fix: always transfer metadata files to all nodes before weight loading

Move transfer_metadata_files() outside the conditional needs_transfer block so it always runs in multi-node setups. This ensures config.json, tokenizer files, and other metadata are present on all nodes before load_model() is called, regardless of download status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: replace has_weight_files with state-based has_local_model + fix pipeline peak memory
2026-02-13 07:32:30 -05:00 · 2026-02-12 15:01:00 -08:00 · 2026-02-12 14:26:23 -08:00 · 2026-02-12 13:38:32 -08:00 · 2026-02-12 13:14:12 -08:00 · 2026-02-12 13:01:22 -08:00
52 changed files with 3977 additions and 768 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -194,3 +194,40 @@ GitHub's API doesn't support direct image upload for PR comments. Workaround:
   git push origin <branch>
   ```
   The images still render in the PR comment because they reference the permanent commit SHA.
+
+## Running exo Remotely via SSH (macOS mDNS)
+
+**CRITICAL: On macOS, mDNS multicast (used for peer discovery) only works when the process runs in a proper macOS user session.** Background processes started via `nohup ... &`, `screen`, or plain SSH commands will NOT send mDNS packets and nodes will never discover each other.
+
+### The Problem
+When you SSH into a Mac and run `nohup uv run exo &`, the process runs in a detached session without access to macOS multicast networking. The exo node will start but will never discover peers, even if they're on the same network.
+
+### The Solution: Use `open` with a `.command` wrapper
+
+Create a `.command` script that `open` will execute in the proper macOS GUI session context:
+
+```bash
+# 1. Create wrapper script on the remote machine
+ssh user@remote-mac "cat > /tmp/run_exo.command << 'SCRIPT'
+#!/bin/bash
+export PATH=/opt/homebrew/bin:\$HOME/.local/bin:\$PATH
+export EXO_LIBP2P_NAMESPACE=your-namespace  # must match across all nodes
+cd ~/path/to/exo
+exec uv run exo -vv 2>&1 | tee /tmp/exo.log
+SCRIPT
+chmod +x /tmp/run_exo.command"
+
+# 2. Launch it via `open` (runs in macOS GUI session with proper mDNS)
+ssh user@remote-mac "open /tmp/run_exo.command"
+
+# 3. Check logs
+ssh user@remote-mac "tail -f /tmp/exo.log"
+```
+
+### Key Details
+- **`EXO_LIBP2P_NAMESPACE`**: All nodes in a cluster MUST use the same namespace value. The EXO.app uses a build-specific namespace (check with `ps eww <pid> | grep NAMESPACE`). If mixing dev builds with EXO.app, set the dev build's namespace to match.
+- **`open *.command`**: This is the macOS equivalent of double-clicking the script in Finder. It runs in the user's GUI session with full network access.
+- **Do NOT use**: `nohup ... &`, `screen -dm`, `tmux new-session -d`, or `sshpass`. These all create detached sessions where mDNS won't work.
+- **Killing**: `ssh user@remote-mac "pkill -f 'python.*exo'"` works fine for stopping.
+- **Dashboard**: Must be built before running: `cd dashboard && npm install && npm run build && cd ..`. Node.js is at `/opt/homebrew/bin/node` on Apple Silicon Macs.
+- **Verifying cluster**: `curl -s http://localhost:52415/state | python3 -c "import json,sys; s=json.load(sys.stdin); print(len(s['topology']['nodes']), 'nodes')"` — should show 2+ nodes.
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -141,6 +141,12 @@ version = "0.3.9"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "76a2e8124351fda1ef8aaaa3bbd7ebbcb486bbcd4225aca0aa0d84bb2db8fecb"

+[[package]]
+name = "arrayvec"
+version = "0.7.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50"
+
 [[package]]
 name = "asn1-rs"
 version = "0.7.1"
@@ -298,6 +304,19 @@ version = "1.8.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "55248b47b0caf0546f7988906588779981c43bb1bc9d0c44087278f80cdb44ba"

+[[package]]
+name = "bigdecimal"
+version = "0.4.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "560f42649de9fa436b73517378a147ec21f6c997a546581df4b4b31677828934"
+dependencies = [
+ "autocfg",
+ "libm",
+ "num-bigint",
+ "num-integer",
+ "num-traits",
+]
+
 [[package]]
 name = "bimap"
 version = "0.6.3"
@@ -497,6 +516,15 @@ version = "0.4.3"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "2f421161cb492475f1661ddc9815a745a1c894592070661180fdec3d4872e9c3"

+[[package]]
+name = "convert_case"
+version = "0.10.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "633458d4ef8c78b72454de2d54fd6ab2e60f9e02be22f3c6104cdc8a4e0fceb9"
+dependencies = [
+ "unicode-segmentation",
+]
+
 [[package]]
 name = "core-foundation"
 version = "0.9.4"
@@ -673,6 +701,17 @@ dependencies = [
 "syn 2.0.111",
 ]

+[[package]]
+name = "delegate"
+version = "0.13.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "780eb241654bf097afb00fc5f054a09b687dad862e485fdcf8399bb056565370"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.111",
+]
+
 [[package]]
 name = "der"
 version = "0.7.10"
@@ -707,6 +746,29 @@ dependencies = [
 "powerfmt",
 ]

+[[package]]
+name = "derive_more"
+version = "2.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "10b768e943bed7bf2cab53df09f4bc34bfd217cdb57d971e769874c9a6710618"
+dependencies = [
+ "derive_more-impl",
+]
+
+[[package]]
+name = "derive_more-impl"
+version = "2.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6d286bfdaf75e988b4a78e013ecd79c581e06399ab53fbacd2d916c2f904f30b"
+dependencies = [
+ "convert_case",
+ "proc-macro2",
+ "quote",
+ "rustc_version",
+ "syn 2.0.111",
+ "unicode-xid",
+]
+
 [[package]]
 name = "digest"
 version = "0.10.7"
@@ -876,23 +938,37 @@ dependencies = [
 name = "exo_pyo3_bindings"
 version = "0.0.1"
 dependencies = [
+ "delegate",
+ "derive_more",
 "env_logger",
- "futures-lite",
+ "extend",
+ "futures",
+ "impl-trait-for-tuples",
 "libp2p",
 "log",
 "networking",
+ "once_cell",
+ "pin-project",
 "pyo3",
 "pyo3-async-runtimes",
 "pyo3-log",
 "pyo3-stub-gen",
+ "thiserror 2.0.17",
+ "thread_local",
 "tokio",
+ "util",
 ]

 [[package]]
-name = "fastrand"
-version = "2.3.0"
+name = "extend"
+version = "1.2.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be"
+checksum = "311a6d2f1f9d60bff73d2c78a0af97ed27f79672f15c238192a5bbb64db56d00"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.111",
+]

 [[package]]
 name = "ff"
@@ -1002,10 +1078,7 @@ version = "2.6.1"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "f78e10609fe0e0b3f4157ffab1876319b5b0db102a2c60dc4626306dc46b44ad"
 dependencies = [
- "fastrand",
 "futures-core",
- "futures-io",
- "parking",
 "pin-project-lite",
 ]

@@ -1567,6 +1640,17 @@ dependencies = [
 "xmltree",
 ]

+[[package]]
+name = "impl-trait-for-tuples"
+version = "0.2.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a0eb5a3343abf848c0984fe4604b2b105da9539376e24fc0a3b0007411ae4fd9"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.111",
+]
+
 [[package]]
 name = "indexmap"
 version = "2.12.1"
@@ -1721,6 +1805,12 @@ dependencies = [
 "cpufeatures",
 ]

+[[package]]
+name = "keccak-const"
+version = "0.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "57d8d8ce877200136358e0bbff3a77965875db3af755a11e1fa6b1b3e2df13ea"
+
 [[package]]
 name = "lalrpop-util"
 version = "0.20.2"
@@ -1739,6 +1829,12 @@ version = "0.2.178"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "37c93d8daa9d8a012fd8ab92f088405fb202ea0b6ab73ee2482ae66af4f42091"

+[[package]]
+name = "libm"
+version = "0.2.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f9fbbcab51052fe104eb5e5d351cf728d30a5be1fe14d9be8a3b097481fb97de"
+
 [[package]]
 name = "libp2p"
 version = "0.56.0"
@@ -2727,10 +2823,20 @@ dependencies = [
 name = "networking"
 version = "0.0.1"
 dependencies = [
+ "delegate",
+ "derive_more",
+ "either",
+ "extend",
+ "futures",
+ "futures-timer",
+ "impl-trait-for-tuples",
+ "keccak-const",
 "libp2p",
 "log",
+ "thiserror 2.0.17",
 "tokio",
 "tracing-subscriber",
+ "util",
 ]

 [[package]]
@@ -2812,6 +2918,17 @@ dependencies = [
 "num-traits",
 ]

+[[package]]
+name = "num-rational"
+version = "0.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f83d14da390562dca69fc84082e73e548e1ad308d24accdedd2720017cb37824"
+dependencies = [
+ "num-bigint",
+ "num-integer",
+ "num-traits",
+]
+
 [[package]]
 name = "num-traits"
 version = "0.2.19"
@@ -3162,14 +3279,28 @@ version = "0.27.2"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "ab53c047fcd1a1d2a8820fe84f05d6be69e9526be40cb03b73f86b6b03e6d87d"
 dependencies = [
+ "bigdecimal",
+ "either",
+ "hashbrown 0.16.1",
+ "indexmap",
 "indoc",
+ "inventory",
 "libc",
+ "lock_api",
 "memoffset",
+ "num-bigint",
+ "num-complex",
+ "num-rational",
+ "num-traits",
 "once_cell",
+ "ordered-float",
+ "parking_lot",
 "portable-atomic",
 "pyo3-build-config",
 "pyo3-ffi",
 "pyo3-macros",
+ "rust_decimal",
+ "smallvec",
 "unindent",
 ]

@@ -3610,6 +3741,16 @@ dependencies = [
 "tokio",
 ]

+[[package]]
+name = "rust_decimal"
+version = "1.39.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "35affe401787a9bd846712274d97654355d21b2a2c092a3139aabe31e9022282"
+dependencies = [
+ "arrayvec",
+ "num-traits",
+]
+
 [[package]]
 name = "rustc-hash"
 version = "1.1.0"
@@ -4474,12 +4615,24 @@ version = "1.0.22"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "9312f7c4f6ff9069b165498234ce8be658059c6728633667c526e27dc2cf1df5"

+[[package]]
+name = "unicode-segmentation"
+version = "1.12.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f6ccf251212114b54433ec949fd6a7841275f9ada20dddd2f29e9ceea4501493"
+
 [[package]]
 name = "unicode-width"
 version = "0.2.2"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254"

+[[package]]
+name = "unicode-xid"
+version = "0.2.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853"
+
 [[package]]
 name = "unicode_names2"
 version = "1.3.0"
@@ -4560,6 +4713,10 @@ version = "0.2.2"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "06abde3611657adf66d383f00b093d7faecc7fa57071cce2578660c9f1010821"

+[[package]]
+name = "util"
+version = "0.0.1"
+
 [[package]]
 name = "uuid"
 version = "1.19.0"
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,6 +3,7 @@ resolver = "3"
 members = [
    "rust/networking",
    "rust/exo_pyo3_bindings",
+    "rust/util",
 ]

 [workspace.package]
@@ -23,18 +24,62 @@ opt-level = 3
 [workspace.dependencies]
 ## Crate members as common dependencies
 networking = { path = "rust/networking" }
+util = { path = "rust/util" }
+
+# Proc-macro authoring tools
+syn = "2.0"
+quote = "1.0"
+proc-macro2 = "1.0"
+darling = "0.20"
+
+# Macro dependecies
+extend = "1.2"
+delegate = "0.13"
+impl-trait-for-tuples = "0.2"
+clap = "4.5"
+derive_more = { version = "2.0.1", features = ["display"] }
+pin-project = "1"
+
+# Utility dependencies
+itertools = "0.14"
+thiserror = "2"
+internment = "0.8"
+recursion = "0.5"
+regex = "1.11"
+once_cell = "1.21"
+thread_local = "1.1"
+bon = "3.4"
+generativity = "1.1"
+anyhow = "1.0"
+keccak-const = "0.2"
+
+# Functional generics/lenses frameworks
+frunk_core = "0.4"
+frunk = "0.4"
+frunk_utils = "0.2"
+frunk-enum-core = "0.3"

 # Async dependencies
 tokio = "1.46"
+futures = "0.3"
+futures-util = "0.3"
+futures-timer = "3.0"
+
+# Data structures
+either = "1.15"
+ordered-float = "5.0"
+ahash = "0.8"

 # Tracing/logging
 log = "0.4"

 # networking
 libp2p = "0.56"
+libp2p-tcp = "0.44"

 [workspace.lints.rust]
-static_mut_refs = "warn"
+static_mut_refs = "warn"      # Or use "warn" instead of deny
+incomplete_features = "allow"

 # Clippy's lint category level configurations;
 # every member crate needs to inherit these by adding
@@ -55,3 +100,64 @@ perf = { level = "warn", priority = -1 }
 pedantic = { level = "warn", priority = -1 }
 nursery = { level = "warn", priority = -1 }
 cargo = { level = "warn", priority = -1 }
+
+# Individual Clippy lints from the `restriction` category
+arithmetic_side_effects = "warn"
+as_conversions = "warn"
+assertions_on_result_states = "warn"
+clone_on_ref_ptr = "warn"
+decimal_literal_representation = "warn"
+default_union_representation = "warn"
+deref_by_slicing = "warn"
+disallowed_script_idents = "deny"
+else_if_without_else = "warn"
+empty_enum_variants_with_brackets = "warn"
+empty_structs_with_brackets = "warn"
+error_impl_error = "warn"
+exit = "deny"
+expect_used = "warn"
+float_cmp_const = "warn"
+get_unwrap = "warn"
+if_then_some_else_none = "warn"
+impl_trait_in_params = "warn"
+indexing_slicing = "warn"
+infinite_loop = "warn"
+let_underscore_must_use = "warn"
+let_underscore_untyped = "warn"
+lossy_float_literal = "warn"
+mem_forget = "warn"
+missing_inline_in_public_items = "warn"
+multiple_inherent_impl = "warn"
+multiple_unsafe_ops_per_block = "warn"
+mutex_atomic = "warn"
+non_zero_suggestions = "warn"
+panic = "warn"
+partial_pub_fields = "warn"
+pattern_type_mismatch = "warn"
+pub_without_shorthand = "warn"
+rc_buffer = "warn"
+rc_mutex = "warn"
+redundant_type_annotations = "warn"
+renamed_function_params = "warn"
+rest_pat_in_fully_bound_structs = "warn"
+same_name_method = "warn"
+self_named_module_files = "deny"
+semicolon_inside_block = "warn"
+shadow_same = "warn"
+shadow_unrelated = "warn"
+str_to_string = "warn"
+string_add = "warn"
+string_lit_chars_any = "warn"
+string_to_string = "warn"
+tests_outside_test_module = "warn"
+todo = "warn"
+try_err = "warn"
+undocumented_unsafe_blocks = "warn"
+unnecessary_safety_comment = "warn"
+unnecessary_safety_doc = "warn"
+unneeded_field_pattern = "warn"
+unseparated_literal_suffix = "warn"
+unused_result_ok = "warn"
+unused_trait_names = "warn"
+unwrap_used = "warn"
+verbose_file_reads = "warn"
--- a/MISSED_THINGS.md
+++ b/MISSED_THINGS.md
@@ -1,5 +1,5 @@
 # Missed things
-[X] Log namespace on start in exo/main.py
+[X] Log EXO_LIBP2P_NAMESPACE on start in exo/main.py
 [X] Ordering of warmup was changed, which is wrong. It was changed to rank < n-1, then rank=n-1. It should be rank!=0 then rank=0 (this matches the auto_parallel implementation. NOTE: we use a different convention to mlx-lm, our terminal rank is rank=n-1 whereas mlx-lm is rank=0 hence i can see why this was changed wrongly).
 [X] Downloads keying by model_id not shard_metadata (worker/plan.py, worker/main.py).
 [X] Fetching download status of all models on start
--- a/README.md
+++ b/README.md
@@ -199,14 +199,14 @@ The app will ask for permission to modify system settings and install a new Netw

 **Custom Namespace for Cluster Isolation:**

-The macOS app includes a custom namespace feature that allows you to isolate your exo cluster from others on the same network. This is configured through the `--namespace` cli arg:
+The macOS app includes a custom namespace feature that allows you to isolate your exo cluster from others on the same network. This is configured through the `EXO_LIBP2P_NAMESPACE` setting:

 - **Use cases**:
  - Running multiple separate exo clusters on the same network
  - Isolating development/testing clusters from production clusters
  - Preventing accidental cluster joining

- **Configuration**: Access this setting in the app's Advanced settings (or set the `--namespace` argument when running from source)
+- **Configuration**: Access this setting in the app's Advanced settings (or set the `EXO_LIBP2P_NAMESPACE` environment variable when running from source)

 The namespace is logged on startup for debugging purposes.

@@ -418,4 +418,4 @@ On macOS, exo uses the GPU. On Linux, exo currently runs on CPU. We are working

 ## Contributing

-See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to contribute to exo.
+See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to contribute to exo.
--- a/app/EXO/EXO/ExoProcessController.swift
+++ b/app/EXO/EXO/ExoProcessController.swift
@@ -82,7 +82,6 @@ final class ExoProcessController: ObservableObject {

            let child = Process()
            child.executableURL = executableURL
-            child.arguments = ["--namespace", computeNamespace()]
            let exoHomeURL = Self.exoDirectoryURL
            try? FileManager.default.createDirectory(
                at: exoHomeURL, withIntermediateDirectories: true
@@ -217,6 +216,7 @@ final class ExoProcessController: ObservableObject {
    private func makeEnvironment(for runtimeURL: URL) -> [String: String] {
        var environment = ProcessInfo.processInfo.environment
        environment["EXO_RUNTIME_DIR"] = runtimeURL.path
+        environment["EXO_LIBP2P_NAMESPACE"] = computeNamespace()
        if !hfToken.isEmpty {
            environment["HF_TOKEN"] = hfToken
        }
--- a/bench/exo_bench.py
+++ b/bench/exo_bench.py
@@ -19,6 +19,11 @@ from urllib.parse import urlencode
 from loguru import logger
 from transformers import AutoTokenizer

+# Backoff constants for cluster settling retry
+_SETTLE_INITIAL_BACKOFF_S = 1.0
+_SETTLE_MAX_BACKOFF_S = 60.0
+_SETTLE_BACKOFF_MULTIPLIER = 2.0
+
 # Monkey-patch for transformers 5.x compatibility
 # Kimi's tokenization_kimi.py imports bytes_to_unicode from the old location
 # which was moved in transformers 5.0.0rc2
@@ -388,6 +393,66 @@ class PromptSizer:
        return content, tok


+def fetch_and_filter_placements(
+    client: ExoClient, full_model_id: str, args: argparse.Namespace
+) -> list[dict[str, Any]]:
+    previews_resp = client.request_json(
+        "GET", "/instance/previews", params={"model_id": full_model_id}
+    )
+    previews = previews_resp.get("previews") or []
+
+    selected: list[dict[str, Any]] = []
+    for p in previews:
+        if p.get("error") is not None:
+            continue
+        if not placement_filter(str(p.get("instance_meta", "")), args.instance_meta):
+            continue
+        if not sharding_filter(str(p.get("sharding", "")), args.sharding):
+            continue
+
+        instance = p.get("instance")
+        if not isinstance(instance, dict):
+            continue
+
+        n = nodes_used_in_instance(instance)
+        # Skip tensor ring single node as it is pointless when pipeline ring
+        if n == 1 and (
+            (args.sharding == "both" and "tensor" in p.get("sharding", "").lower())
+            or (
+                args.instance_meta == "both"
+                and "jaccl" in p.get("instance_meta", "").lower()
+            )
+        ):
+            continue
+
+        if (
+            args.skip_pipeline_jaccl
+            and (
+                args.instance_meta == "both"
+                and "jaccl" in p.get("instance_meta", "").lower()
+            )
+            and (
+                args.sharding == "both" and "pipeline" in p.get("sharding", "").lower()
+            )
+        ):
+            continue
+
+        if (
+            args.skip_tensor_ring
+            and (
+                args.instance_meta == "both"
+                and "ring" in p.get("instance_meta", "").lower()
+            )
+            and (args.sharding == "both" and "tensor" in p.get("sharding", "").lower())
+        ):
+            continue
+
+        if args.min_nodes <= n <= args.max_nodes:
+            selected.append(p)
+
+    return selected
+
+
 def main() -> int:
    ap = argparse.ArgumentParser(
        prog="exo-bench",
@@ -464,6 +529,12 @@ def main() -> int:
        action="store_true",
        help="Force all pp×tg combinations (cartesian product) even when lists have equal length.",
    )
+    ap.add_argument(
+        "--settle-timeout",
+        type=float,
+        default=0,
+        help="Max seconds to wait for the cluster to produce valid placements (0 = try once).",
+    )
    args = ap.parse_args()

    pp_list = parse_int_list(args.pp)
@@ -487,11 +558,6 @@ def main() -> int:
    client = ExoClient(args.host, args.port, timeout_s=args.timeout)
    short_id, full_model_id = resolve_model_short_id(client, args.model)

-    previews_resp = client.request_json(
-        "GET", "/instance/previews", params={"model_id": full_model_id}
-    )
-    previews = previews_resp.get("previews") or []
-
    tokenizer = load_tokenizer_for_bench(full_model_id)
    if tokenizer is None:
        raise RuntimeError("[exo-bench] tokenizer load failed")
@@ -503,54 +569,20 @@ def main() -> int:
        logger.error("[exo-bench] tokenizer usable but prompt sizing failed")
        raise

-    selected: list[dict[str, Any]] = []
-    for p in previews:
-        if p.get("error") is not None:
-            continue
-        if not placement_filter(str(p.get("instance_meta", "")), args.instance_meta):
-            continue
-        if not sharding_filter(str(p.get("sharding", "")), args.sharding):
-            continue
+    selected = fetch_and_filter_placements(client, full_model_id, args)

-        instance = p.get("instance")
-        if not isinstance(instance, dict):
-            continue
-
-        n = nodes_used_in_instance(instance)
-        # Skip tensor ring single node as it is pointless when pipeline ring
-        if n == 1 and (
-            (args.sharding == "both" and "tensor" in p.get("sharding", "").lower())
-            or (
-                args.instance_meta == "both"
-                and "jaccl" in p.get("instance_meta", "").lower()
+    if not selected and args.settle_timeout > 0:
+        backoff = _SETTLE_INITIAL_BACKOFF_S
+        deadline = time.monotonic() + args.settle_timeout
+        while not selected and time.monotonic() < deadline:
+            remaining = deadline - time.monotonic()
+            logger.warning(
+                f"No valid placements yet (cluster may still be settling). "
+                f"Retrying in {backoff:.1f}s ({remaining:.0f}s remaining)..."
            )
-        ):
-            continue
-
-        if (
-            args.skip_pipeline_jaccl
-            and (
-                args.instance_meta == "both"
-                and "jaccl" in p.get("instance_meta", "").lower()
-            )
-            and (
-                args.sharding == "both" and "pipeline" in p.get("sharding", "").lower()
-            )
-        ):
-            continue
-
-        if (
-            args.skip_tensor_ring
-            and (
-                args.instance_meta == "both"
-                and "ring" in p.get("instance_meta", "").lower()
-            )
-            and (args.sharding == "both" and "tensor" in p.get("sharding", "").lower())
-        ):
-            continue
-
-        if args.min_nodes <= n <= args.max_nodes:
-            selected.append(p)
+            time.sleep(min(backoff, remaining))
+            backoff = min(backoff * _SETTLE_BACKOFF_MULTIPLIER, _SETTLE_MAX_BACKOFF_S)
+            selected = fetch_and_filter_placements(client, full_model_id, args)

    if not selected:
        logger.error("No valid placements matched your filters.")
--- a/rust/clippy.toml
+++ b/rust/clippy.toml
@@ -0,0 +1,2 @@
+# we can manually exclude false-positive lint errors for dual packages (if in dependencies)
+#allowed-duplicate-crates = ["hashbrown"]
--- a/rust/exo_pyo3_bindings/Cargo.toml
+++ b/rust/exo_pyo3_bindings/Cargo.toml
@@ -25,26 +25,44 @@ workspace = true
 networking = { workspace = true }

 # interop
-pyo3 = { version = "0.27.2", features = [
-    "abi3-py313", # tells pyo3 (and maturin) to build using the stable ABI with minimum Python version 3.13
-    # "nightly", # enables better-supported GIL integration
-    "experimental-async" # async support in #[pyfunction] & #[pymethods]
-    # "experimental-inspect", # inspection of generated binary => easier to automate type-hint generation
-    # "py-clone", # adding Clone-ing of `Py<T>` without GIL (may cause panics - remove if panics happen)
-    # "multiple-pymethods", # allows multiple #[pymethods] sections per class
+pyo3 = { version = "0.27.1", features = [
+    # "abi3-py311", # tells pyo3 (and maturin) to build using the stable ABI with minimum Python version 3.11
+    "nightly", # enables better-supported GIL integration
+    "experimental-async", # async support in #[pyfunction] & #[pymethods]
+    #"experimental-inspect", # inspection of generated binary => easier to automate type-hint generation
+    #"py-clone", # adding Clone-ing of `Py<T>` without GIL (may cause panics - remove if panics happen)
+    "multiple-pymethods", # allows multiple #[pymethods] sections per class

    # integrations with other libraries
-    # "arc_lock", "bigdecimal", "either", "hashbrown", "indexmap", "num-bigint", "num-complex", "num-rational",
-    # "ordered-float", "rust_decimal", "smallvec",
+    "arc_lock", "bigdecimal", "either", "hashbrown", "indexmap", "num-bigint", "num-complex", "num-rational",
+    "ordered-float", "rust_decimal", "smallvec",
    # "anyhow", "chrono", "chrono-local", "chrono-tz", "eyre", "jiff-02", "lock_api", "parking-lot", "time",  "serde",
 ] }
 pyo3-stub-gen = { version = "0.17.2" }
 pyo3-async-runtimes = { version = "0.27.0", features = ["attributes", "tokio-runtime", "testing"] }
 pyo3-log = "0.13.2"

+# macro dependencies
+extend = { workspace = true }
+delegate = { workspace = true }
+impl-trait-for-tuples = { workspace = true }
+derive_more = { workspace = true }
+pin-project = { workspace = true }
+
 # async runtime
 tokio = { workspace = true, features = ["full", "tracing"] }
-futures-lite = "2.6.1"
+futures = { workspace = true }
+
+# utility dependencies
+once_cell = "1.21.3"
+thread_local = "1.1.9"
+util = { workspace = true }
+thiserror = { workspace = true }
+#internment = { workspace = true }
+#recursion = { workspace = true }
+#generativity = { workspace = true }
+#itertools = { workspace = true }
+

 # Tracing
 #tracing = "0.1"
--- a/rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi
+++ b/rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi
@@ -2,39 +2,220 @@
 # ruff: noqa: E501, F401

 import builtins
+import enum
 import typing

@typing.final
-class Keypair:
-    @staticmethod
-    def generate() -> Keypair:
+class AllQueuesFullError(builtins.Exception):
+    def __new__(cls, *args: typing.Any) -> AllQueuesFullError: ...
+    def __repr__(self) -> builtins.str: ...
+    def __str__(self) -> builtins.str: ...
+
+@typing.final
+class ConnectionUpdate:
+    @property
+    def update_type(self) -> ConnectionUpdateType:
        r"""
-        Generate a new ed25519 keypair
+        Whether this is a connection or disconnection event
+        """
+    @property
+    def peer_id(self) -> PeerId:
+        r"""
+        Identity of the peer that we have connected to or disconnected from.
+        """
+    @property
+    def remote_ipv4(self) -> builtins.str:
+        r"""
+        Remote connection's IPv4 address.
+        """
+    @property
+    def remote_tcp_port(self) -> builtins.int:
+        r"""
+        Remote connection's TCP port.
+        """
+
+@typing.final
+class Keypair:
+    r"""
+    Identity keypair of a node.
+    """
+    @staticmethod
+    def generate_ed25519() -> Keypair:
+        r"""
+        Generate a new Ed25519 keypair.
+        """
+    @staticmethod
+    def generate_ecdsa() -> Keypair:
+        r"""
+        Generate a new ECDSA keypair.
+        """
+    @staticmethod
+    def generate_secp256k1() -> Keypair:
+        r"""
+        Generate a new Secp256k1 keypair.
        """
    @staticmethod
    def from_protobuf_encoding(bytes: bytes) -> Keypair:
        r"""
        Decode a private key from a protobuf structure and parse it as a `Keypair`.
        """
+    @staticmethod
+    def rsa_from_pkcs8(bytes: bytes) -> Keypair:
+        r"""
+        Decode an keypair from a DER-encoded secret key in PKCS#8 `PrivateKeyInfo`
+        format (i.e. unencrypted) as defined in [RFC5208].
+        
+        [RFC5208]: https://tools.ietf.org/html/rfc5208#section-5
+        """
+    @staticmethod
+    def secp256k1_from_der(bytes: bytes) -> Keypair:
+        r"""
+        Decode a keypair from a DER-encoded Secp256k1 secret key in an `ECPrivateKey`
+        structure as defined in [RFC5915].
+        
+        [RFC5915]: https://tools.ietf.org/html/rfc5915
+        """
+    @staticmethod
+    def ed25519_from_bytes(bytes: bytes) -> Keypair: ...
    def to_protobuf_encoding(self) -> bytes:
        r"""
-        Encode a private key to a protobuf structure.
+        Encode a private key as protobuf structure.
+        """
+    def to_peer_id(self) -> PeerId:
+        r"""
+        Convert the `Keypair` into the corresponding `PeerId`.
        """
-    def to_string(self) -> builtins.str: ...

@typing.final
-class PyPeer:
+class Multiaddr:
+    r"""
+    Representation of a Multiaddr.
+    """
    @staticmethod
-    def new(kp: Keypair, namespace: builtins.str) -> PyPeer: ...
-    async def subscribe(self, topic: builtins.str) -> None: ...
-    async def unsubscribe(self, topic: builtins.str) -> None: ...
-    async def send(self, topic: builtins.str, payload: bytes) -> None: ...
-    async def run(self) -> None: ...
-    async def recv(self) -> PySwarmEvent: ...
+    def empty() -> Multiaddr:
+        r"""
+        Create a new, empty multiaddress.
+        """
+    @staticmethod
+    def with_capacity(n: builtins.int) -> Multiaddr:
+        r"""
+        Create a new, empty multiaddress with the given capacity.
+        """
+    @staticmethod
+    def from_bytes(bytes: bytes) -> Multiaddr:
+        r"""
+        Parse a `Multiaddr` value from its byte slice representation.
+        """
+    @staticmethod
+    def from_string(string: builtins.str) -> Multiaddr:
+        r"""
+        Parse a `Multiaddr` value from its string representation.
+        """
+    def len(self) -> builtins.int:
+        r"""
+        Return the length in bytes of this multiaddress.
+        """
+    def is_empty(self) -> builtins.bool:
+        r"""
+        Returns true if the length of this multiaddress is 0.
+        """
+    def to_bytes(self) -> bytes:
+        r"""
+        Return a copy of this [`Multiaddr`]'s byte representation.
+        """
+    def to_string(self) -> builtins.str:
+        r"""
+        Convert a Multiaddr to a string.
+        """

@typing.final
-class PySwarmEvent:
-    def downcast_discovered(self) -> typing.Optional[builtins.str]: ...
-    def downcast_expired(self) -> typing.Optional[builtins.str]: ...
-    def downcast_message(self) -> typing.Optional[tuple[builtins.str, builtins.str, bytes]]: ...
+class NetworkingHandle:
+    def __new__(cls, identity: Keypair) -> NetworkingHandle: ...
+    async def connection_update_recv(self) -> ConnectionUpdate:
+        r"""
+        Receives the next `ConnectionUpdate` from networking.
+        """
+    async def connection_update_recv_many(self, limit: builtins.int) -> builtins.list[ConnectionUpdate]:
+        r"""
+        Receives at most `limit` `ConnectionUpdate`s from networking and returns them.
+        
+        For `limit = 0`, an empty collection of `ConnectionUpdate`s will be returned immediately.
+        For `limit > 0`, if there are no `ConnectionUpdate`s in the channel's queue this method
+        will sleep until a `ConnectionUpdate`s is sent.
+        """
+    async def gossipsub_subscribe(self, topic: builtins.str) -> builtins.bool:
+        r"""
+        Subscribe to a `GossipSub` topic.
+        
+        Returns `True` if the subscription worked. Returns `False` if we were already subscribed.
+        """
+    async def gossipsub_unsubscribe(self, topic: builtins.str) -> builtins.bool:
+        r"""
+        Unsubscribes from a `GossipSub` topic.
+        
+        Returns `True` if we were subscribed to this topic. Returns `False` if we were not subscribed.
+        """
+    async def gossipsub_publish(self, topic: builtins.str, data: bytes) -> None:
+        r"""
+        Publishes a message with multiple topics to the `GossipSub` network.
+        
+        If no peers are found that subscribe to this topic, throws `NoPeersSubscribedToTopicError` exception.
+        """
+    async def gossipsub_recv(self) -> tuple[builtins.str, bytes]:
+        r"""
+        Receives the next message from the `GossipSub` network.
+        """
+    async def gossipsub_recv_many(self, limit: builtins.int) -> builtins.list[tuple[builtins.str, bytes]]:
+        r"""
+        Receives at most `limit` messages from the `GossipSub` network and returns them.
+        
+        For `limit = 0`, an empty collection of messages will be returned immediately.
+        For `limit > 0`, if there are no messages in the channel's queue this method
+        will sleep until a message is sent.
+        """
+
+@typing.final
+class NoPeersSubscribedToTopicError(builtins.Exception):
+    def __new__(cls, *args: typing.Any) -> NoPeersSubscribedToTopicError: ...
+    def __repr__(self) -> builtins.str: ...
+    def __str__(self) -> builtins.str: ...
+
+@typing.final
+class PeerId:
+    r"""
+    Identifier of a peer of the network.
+    
+    The data is a `CIDv0` compatible multihash of the protobuf encoded public key of the peer
+    as specified in [specs/peer-ids](https://github.com/libp2p/specs/blob/master/peer-ids/peer-ids.md).
+    """
+    @staticmethod
+    def random() -> PeerId:
+        r"""
+        Generates a random peer ID from a cryptographically secure PRNG.
+        
+        This is useful for randomly walking on a DHT, or for testing purposes.
+        """
+    @staticmethod
+    def from_bytes(bytes: bytes) -> PeerId:
+        r"""
+        Parses a `PeerId` from bytes.
+        """
+    def to_bytes(self) -> bytes:
+        r"""
+        Returns a raw bytes representation of this `PeerId`.
+        """
+    def to_base58(self) -> builtins.str:
+        r"""
+        Returns a base-58 encoded string of this `PeerId`.
+        """
+    def __repr__(self) -> builtins.str: ...
+    def __str__(self) -> builtins.str: ...
+
+@typing.final
+class ConnectionUpdateType(enum.Enum):
+    r"""
+    Connection or disconnection event discriminant type.
+    """
+    Connected = ...
+    Disconnected = ...

--- a/rust/exo_pyo3_bindings/src/allow_threading.rs
+++ b/rust/exo_pyo3_bindings/src/allow_threading.rs
@@ -1,4 +1,8 @@
-//! See: <https://pyo3.rs/v0.27.2/async-await.html#detaching-from-the-interpreter-across-await>
+//! SEE: https://pyo3.rs/v0.26.0/async-await.html#detaching-from-the-interpreter-across-await
+//!
+
+use pin_project::pin_project;
+use pyo3::marker::Ungil;
 use pyo3::prelude::*;
 use std::{
    future::Future,
@@ -6,17 +10,31 @@ use std::{
    task::{Context, Poll},
 };

-pub struct AllowThreads<F>(pub(crate) F);
+/// SEE: https://pyo3.rs/v0.26.0/async-await.html#detaching-from-the-interpreter-across-await
+#[pin_project]
+#[repr(transparent)]
+pub(crate) struct AllowThreads<F>(#[pin] F);
+
+impl<F> AllowThreads<F>
+where
+    Self: Future,
+{
+    pub fn new(f: F) -> Self {
+        Self(f)
+    }
+}

 impl<F> Future for AllowThreads<F>
 where
-    F: Future + Unpin + Send,
-    F::Output: Send,
+    F: Future + Ungil,
+    F::Output: Ungil,
 {
    type Output = F::Output;

-    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
+    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let waker = cx.waker();
-        Python::attach(|py| py.detach(|| pin!(&mut self.0).poll(&mut Context::from_waker(waker))))
+        Python::with_gil(|py| {
+            py.allow_threads(|| self.project().0.poll(&mut Context::from_waker(waker)))
+        })
    }
 }
--- a/rust/exo_pyo3_bindings/src/examples/mod.rs
+++ b/rust/exo_pyo3_bindings/src/examples/mod.rs
@@ -0,0 +1,240 @@
+//! This module exists to hold examples of some pyo3 patterns that may be too complex to
+//! re-create from scratch, but too inhomogenous to create an abstraction/wrapper around.
+//!
+//! Pattern examples include:
+//!  - Async task handles: with GC-integrated cleanup
+//!  - Sync/async callbacks from python: with propper eventloop handling
+//!
+//! Mutability pattern: https://pyo3.rs/v0.26.0/async-await.html#send--static-constraint
+//!  - Store mutable fields in tokio's `Mutex<T>`
+//!  - For async code: take `&self` and `.lock().await`
+//!  - For sync code: take `&mut self` and `.get_mut()`
+
+use crate::ext::{PyResultExt as _, ResultExt as _, TokioRuntimeExt as _};
+use futures::FutureExt as _;
+use futures::future::BoxFuture;
+use pyo3::exceptions::PyRuntimeError;
+use pyo3::prelude::{PyModule, PyModuleMethods as _};
+use pyo3::{
+    Bound, Py, PyAny, PyErr, PyResult, PyTraverseError, PyVisit, Python, pyclass, pymethods,
+};
+use std::time::Duration;
+use tokio::sync::mpsc;
+use tokio::sync::mpsc::error::TryRecvError;
+
+fn needs_tokio_runtime() {
+    tokio::runtime::Handle::current();
+}
+
+type SyncCallback = Box<dyn Fn() + Send + Sync>;
+type AsyncCallback = Box<dyn Fn() -> BoxFuture<'static, ()> + Send + Sync>;
+
+enum AsyncTaskMessage {
+    SyncCallback(SyncCallback),
+    AsyncCallback(AsyncCallback),
+}
+
+async fn async_task(
+    sender: mpsc::UnboundedSender<()>,
+    mut receiver: mpsc::UnboundedReceiver<AsyncTaskMessage>,
+) {
+    log::info!("RUST: async task started");
+
+    // task state
+    let mut interval = tokio::time::interval(Duration::from_secs(1));
+
+    let mut sync_cbs: Vec<SyncCallback> = vec![];
+    let mut async_cbs: Vec<AsyncCallback> = vec![];
+
+    loop {
+        tokio::select! {
+            // handle incoming messages from task-handle
+            message = receiver.recv() => {
+                // handle closed channel by exiting
+                let Some(message) = message else {
+                    log::info!("RUST: channel closed");
+                    break;
+                };
+
+                // dispatch incoming event
+                match message {
+                    AsyncTaskMessage::SyncCallback(cb) => {
+                        sync_cbs.push(cb);
+                    }
+                    AsyncTaskMessage::AsyncCallback(cb) => {
+                        async_cbs.push(cb);
+                    }
+                }
+            }
+
+            // handle all other events
+            _ = interval.tick() => {
+                log::info!("RUST: async task tick");
+
+                // call back all sync callbacks
+                for cb in &sync_cbs {
+                    cb();
+                }
+
+                // call back all async callbacks
+                for cb in &async_cbs {
+                    cb().await;
+                }
+
+                // send event on unbounded channel
+                sender.send(()).expect("handle receiver cannot be closed/dropped");
+            }
+        }
+    }
+
+    log::info!("RUST: async task stopped");
+}
+
+// #[gen_stub_pyclass]
+#[pyclass(name = "AsyncTaskHandle")]
+#[derive(Debug)]
+struct PyAsyncTaskHandle {
+    sender: Option<mpsc::UnboundedSender<AsyncTaskMessage>>,
+    receiver: mpsc::UnboundedReceiver<()>,
+}
+
+#[allow(clippy::expect_used)]
+impl PyAsyncTaskHandle {
+    const fn sender(&self) -> &mpsc::UnboundedSender<AsyncTaskMessage> {
+        self.sender
+            .as_ref()
+            .expect("The sender should only be None after de-initialization.")
+    }
+
+    const fn sender_mut(&mut self) -> &mpsc::UnboundedSender<AsyncTaskMessage> {
+        self.sender
+            .as_mut()
+            .expect("The sender should only be None after de-initialization.")
+    }
+
+    const fn new(
+        sender: mpsc::UnboundedSender<AsyncTaskMessage>,
+        receiver: mpsc::UnboundedReceiver<()>,
+    ) -> Self {
+        Self {
+            sender: Some(sender),
+            receiver,
+        }
+    }
+}
+
+// #[gen_stub_pymethods]
+#[pymethods]
+impl PyAsyncTaskHandle {
+    #[new]
+    fn py_new(py: Python<'_>) -> PyResult<Self> {
+        use pyo3_async_runtimes::tokio::get_runtime;
+
+        // create communication channel TOWARDS our task
+        let (h_sender, t_receiver) = mpsc::unbounded_channel::<AsyncTaskMessage>();
+
+        // create communication channel FROM our task
+        let (t_sender, h_receiver) = mpsc::unbounded_channel::<()>();
+
+        // perform necessary setup within tokio context - or it crashes
+        let () = get_runtime().block_on(async { needs_tokio_runtime() });
+
+        // spawn tokio task with this thread's task-locals - without this, async callbacks on the new threads will not work!!
+        _ = get_runtime().spawn_with_scope(py, async move {
+            async_task(t_sender, t_receiver).await;
+        });
+        Ok(Self::new(h_sender, h_receiver))
+    }
+
+    /// NOTE: exceptions in callbacks are silently ignored until end of execution
+    fn add_sync_callback(
+        &self,
+        // #[gen_stub(override_type(
+        //     type_repr="collections.abc.Callable[[], None]",
+        //     imports=("collections.abc")
+        // ))]
+        callback: Py<PyAny>,
+    ) -> PyResult<()> {
+        // blocking call to async method -> can do non-blocking if needed
+        self.sender()
+            .send(AsyncTaskMessage::SyncCallback(Box::new(move || {
+                _ = Python::with_gil(|py| callback.call0(py).write_unraisable_with(py));
+            })))
+            .pyerr()?;
+        Ok(())
+    }
+
+    /// NOTE: exceptions in callbacks are silently ignored until end of execution
+    fn add_async_callback(
+        &self,
+        // #[gen_stub(override_type(
+        //     type_repr="collections.abc.Callable[[], collections.abc.Awaitable[None]]",
+        //     imports=("collections.abc")
+        // ))]
+        callback: Py<PyAny>,
+    ) -> PyResult<()> {
+        // blocking call to async method -> can do non-blocking if needed
+        self.sender()
+            .send(AsyncTaskMessage::AsyncCallback(Box::new(move || {
+                let c = Python::with_gil(|py| callback.clone_ref(py));
+                async move {
+                    if let Some(f) = Python::with_gil(|py| {
+                        let coroutine = c.call0(py).write_unraisable_with(py)?;
+                        pyo3_async_runtimes::tokio::into_future(coroutine.into_bound(py))
+                            .write_unraisable_with(py)
+                    }) {
+                        _ = f.await.write_unraisable();
+                    }
+                }
+                .boxed()
+            })))
+            .pyerr()?;
+        Ok(())
+    }
+
+    async fn receive_unit(&mut self) -> PyResult<()> {
+        self.receiver
+            .recv()
+            .await
+            .ok_or(PyErr::new::<PyRuntimeError, _>(
+                "cannot receive unit on closed channel",
+            ))
+    }
+
+    fn drain_units(&mut self) -> PyResult<i32> {
+        let mut cnt = 0;
+        loop {
+            match self.receiver.try_recv() {
+                Err(TryRecvError::Disconnected) => {
+                    return Err(PyErr::new::<PyRuntimeError, _>(
+                        "cannot receive unit on closed channel",
+                    ));
+                }
+                Err(TryRecvError::Empty) => return Ok(cnt),
+                Ok(()) => {
+                    cnt += 1;
+                    continue;
+                }
+            }
+        }
+    }
+
+    // #[gen_stub(skip)]
+    const fn __traverse__(&self, _visit: PyVisit<'_>) -> Result<(), PyTraverseError> {
+        Ok(()) // This is needed purely so `__clear__` can work
+    }
+
+    // #[gen_stub(skip)]
+    fn __clear__(&mut self) {
+        // TODO: may or may not need to await a "kill-signal" oneshot channel message,
+        //       to ensure that the networking task is done BEFORE exiting the clear function...
+        //       but this may require GIL?? and it may not be safe to call GIL here??
+        self.sender = None; // Using Option<T> as a trick to force `sender` channel to be dropped
+    }
+}
+
+pub fn examples_submodule(m: &Bound<'_, PyModule>) -> PyResult<()> {
+    m.add_class::<PyAsyncTaskHandle>()?;
+
+    Ok(())
+}
--- a/rust/exo_pyo3_bindings/src/lib.rs
+++ b/rust/exo_pyo3_bindings/src/lib.rs
@@ -1,42 +1,216 @@
 //! TODO: crate documentation
-pub(crate) mod allow_threading;
+//!
+//! this is here as a placeholder documentation
+//!
+//!

+// enable Rust-unstable features for convenience
+#![feature(trait_alias)]
+#![feature(tuple_trait)]
+#![feature(unboxed_closures)]
+// #![feature(stmt_expr_attributes)]
+// #![feature(assert_matches)]
+// #![feature(async_fn_in_dyn_trait)]
+// #![feature(async_for_loop)]
+// #![feature(auto_traits)]
+// #![feature(negative_impls)]
+
+extern crate core;
+mod allow_threading;
+mod examples;
 pub(crate) mod networking;
-pub(crate) mod take_once {
-    use std::sync::Mutex;
+pub(crate) mod pylibp2p;

-    pub struct TakeOnce<T>(Mutex<Option<T>>);
-    impl<T> TakeOnce<T> {
-        pub fn new(t: T) -> Self {
-            Self(Mutex::new(Some(t)))
+use crate::networking::networking_submodule;
+use crate::pylibp2p::ident::ident_submodule;
+use crate::pylibp2p::multiaddr::multiaddr_submodule;
+use pyo3::prelude::PyModule;
+use pyo3::prelude::*;
+use pyo3::{Bound, PyResult, pyclass, pymodule};
+use pyo3_stub_gen::define_stub_info_gatherer;
+
+/// Namespace for all the constants used by this crate.
+pub(crate) mod r#const {
+    pub const MPSC_CHANNEL_SIZE: usize = 1024;
+}
+
+/// Namespace for all the type/trait aliases used by this crate.
+pub(crate) mod alias {
+    use std::error::Error;
+    use std::marker::Tuple;
+
+    pub trait SendFn<Args: Tuple + Send + 'static, Output> =
+        Fn<Args, Output = Output> + Send + 'static;
+
+    pub type AnyError = Box<dyn Error + Send + Sync + 'static>;
+    pub type AnyResult<T> = Result<T, AnyError>;
+}
+
+/// Namespace for crate-wide extension traits/methods
+pub(crate) mod ext {
+    use crate::allow_threading::AllowThreads;
+    use extend::ext;
+    use pyo3::exceptions::{PyConnectionError, PyRuntimeError};
+    use pyo3::marker::Ungil;
+    use pyo3::types::PyBytes;
+    use pyo3::{Py, PyErr, PyResult, Python};
+    use tokio::runtime::Runtime;
+    use tokio::sync::mpsc;
+    use tokio::sync::mpsc::error::TryRecvError;
+    use tokio::task::JoinHandle;
+
+    #[ext(pub, name = ByteArrayExt)]
+    impl [u8] {
+        fn pybytes(&self) -> Py<PyBytes> {
+            Python::with_gil(|py| PyBytes::new(py, self).unbind())
        }
-        pub fn take(&self) -> Option<T> {
-            match self.0.try_lock() {
-                Ok(mut o) => o.take(),
-                Err(_) => None,
+    }
+
+    #[ext(pub, name = ResultExt)]
+    impl<T, E> Result<T, E>
+    where
+        E: ToString,
+    {
+        fn pyerr(self) -> PyResult<T> {
+            self.map_err(|e| PyRuntimeError::new_err(e.to_string()))
+        }
+    }
+
+    pub trait FutureExt: Future + Sized {
+        /// SEE: https://pyo3.rs/v0.26.0/async-await.html#detaching-from-the-interpreter-across-await
+        fn allow_threads_py(self) -> AllowThreads<Self>
+        where
+            AllowThreads<Self>: Future,
+        {
+            AllowThreads::new(self)
+        }
+    }
+
+    impl<T: Future> FutureExt for T {}
+
+    #[ext(pub, name = PyErrExt)]
+    impl PyErr {
+        fn receiver_channel_closed() -> Self {
+            PyConnectionError::new_err("Receiver channel closed unexpectedly")
+        }
+    }
+
+    #[ext(pub, name = PyResultExt)]
+    impl<T> PyResult<T> {
+        fn write_unraisable(self) -> Option<T> {
+            Python::with_gil(|py| self.write_unraisable_with(py))
+        }
+
+        fn write_unraisable_with(self, py: Python<'_>) -> Option<T> {
+            match self {
+                Ok(v) => Some(v),
+                Err(e) => {
+                    // write error back to python
+                    e.write_unraisable(py, None);
+                    None
+                }
+            }
+        }
+    }
+
+    #[ext(pub, name = TokioRuntimeExt)]
+    impl Runtime {
+        fn spawn_with_scope<F>(&self, py: Python<'_>, future: F) -> PyResult<JoinHandle<F::Output>>
+        where
+            F: Future + Send + 'static,
+            F::Output: Send + 'static,
+        {
+            let locals = pyo3_async_runtimes::tokio::get_current_locals(py)?;
+            Ok(self.spawn(pyo3_async_runtimes::tokio::scope(locals, future)))
+        }
+    }
+
+    #[ext(pub, name = TokioMpscSenderExt)]
+    impl<T> mpsc::Sender<T> {
+        /// Sends a value, waiting until there is capacity.
+        ///
+        /// A successful send occurs when it is determined that the other end of the
+        /// channel has not hung up already. An unsuccessful send would be one where
+        /// the corresponding receiver has already been closed.
+        async fn send_py(&self, value: T) -> PyResult<()> {
+            self.send(value)
+                .await
+                .map_err(|_| PyErr::receiver_channel_closed())
+        }
+    }
+
+    #[ext(pub, name = TokioMpscReceiverExt)]
+    impl<T> mpsc::Receiver<T> {
+        /// Receives the next value for this receiver.
+        async fn recv_py(&mut self) -> PyResult<T> {
+            self.recv().await.ok_or_else(PyErr::receiver_channel_closed)
+        }
+
+        /// Receives at most `limit` values for this receiver and returns them.
+        ///
+        /// For `limit = 0`, an empty collection of messages will be returned immediately.
+        /// For `limit > 0`, if there are no messages in the channel's queue this method
+        /// will sleep until a message is sent.
+        async fn recv_many_py(&mut self, limit: usize) -> PyResult<Vec<T>> {
+            // get updates from receiver channel
+            let mut updates = Vec::with_capacity(limit);
+            let received = self.recv_many(&mut updates, limit).await;
+
+            // if we received zero items, then the channel was unexpectedly closed
+            if limit != 0 && received == 0 {
+                return Err(PyErr::receiver_channel_closed());
+            }
+
+            Ok(updates)
+        }
+
+        /// Tries to receive the next value for this receiver.
+        fn try_recv_py(&mut self) -> PyResult<Option<T>> {
+            match self.try_recv() {
+                Ok(v) => Ok(Some(v)),
+                Err(TryRecvError::Empty) => Ok(None),
+                Err(TryRecvError::Disconnected) => Err(PyErr::receiver_channel_closed()),
            }
        }
    }
 }

-use pyo3::prelude::*;
+pub(crate) mod private {
+    use std::marker::Sized;

-use pyo3_stub_gen::define_stub_info_gatherer;
+    /// Sealed traits support
+    pub trait Sealed {}
+    impl<T: ?Sized> Sealed for T {}
+}
+
+/// A wrapper around [`Py`] that implements [`Clone`] using [`Python::with_gil`].
+#[repr(transparent)]
+pub(crate) struct ClonePy<T>(pub Py<T>);
+
+impl<T> Clone for ClonePy<T> {
+    fn clone(&self) -> Self {
+        Python::with_gil(|py| Self(self.0.clone_ref(py)))
+    }
+}

 /// A Python module implemented in Rust. The name of this function must match
 /// the `lib.name` setting in the `Cargo.toml`, else Python will not be able to
 /// import the module.
 #[pymodule(name = "exo_pyo3_bindings")]
-pub fn networking_module(m: &Bound<'_, PyModule>) -> PyResult<()> {
+fn main_module(m: &Bound<'_, PyModule>) -> PyResult<()> {
    // install logger
    pyo3_log::init();
-    // setup runtime
-    let mut builder = tokio::runtime::Builder::new_multi_thread();
-    builder.enable_all();
-    pyo3_async_runtimes::tokio::init(builder);

-    m.add_class::<networking::PyPeer>()?;
-    m.add_class::<networking::PyKeypair>()?;
+    // TODO: for now this is all NOT a submodule, but figure out how to make the submodule system
+    //       work with maturin, where the types generate correctly, in the right folder, without
+    //       too many importing issues...
+    ident_submodule(m)?;
+    multiaddr_submodule(m)?;
+    networking_submodule(m)?;
+
+    // top-level constructs
+    // TODO: ...
+
    Ok(())
 }

--- a/rust/exo_pyo3_bindings/src/networking.rs
+++ b/rust/exo_pyo3_bindings/src/networking.rs
@@ -1,214 +1,572 @@
-use crate::allow_threading::AllowThreads;
-use crate::take_once::TakeOnce;
+#![allow(
+    clippy::multiple_inherent_impl,
+    clippy::unnecessary_wraps,
+    clippy::unused_self,
+    clippy::needless_pass_by_value
+)]

-use std::pin::pin;
+use crate::r#const::MPSC_CHANNEL_SIZE;
+use crate::ext::{ByteArrayExt as _, FutureExt, PyErrExt as _};
+use crate::ext::{ResultExt as _, TokioMpscReceiverExt as _, TokioMpscSenderExt as _};
+use crate::pyclass;
+use crate::pylibp2p::ident::{PyKeypair, PyPeerId};
+use libp2p::futures::StreamExt as _;
+use libp2p::gossipsub::{IdentTopic, Message, MessageId, PublishError};
+use libp2p::swarm::SwarmEvent;
+use libp2p::{gossipsub, mdns};
+use networking::discovery;
+use networking::swarm::create_swarm;
+use pyo3::prelude::{PyModule, PyModuleMethods as _};
+use pyo3::types::PyBytes;
+use pyo3::{Bound, Py, PyErr, PyResult, PyTraverseError, PyVisit, Python, pymethods};
+use pyo3_stub_gen::derive::{gen_stub_pyclass, gen_stub_pyclass_enum, gen_stub_pymethods};
+use std::net::IpAddr;
+use tokio::sync::{Mutex, mpsc, oneshot};

-use futures_lite::FutureExt;
-use libp2p::{gossipsub::PublishError, identity::Keypair};
-use networking::{FromSwarm, Peer, ToSwarm};
-use pyo3::{
-    coroutine::CancelHandle,
-    exceptions::{PyConnectionError, PyRuntimeError, PyValueError},
-    prelude::*,
-    types::PyBytes,
-};
-use pyo3_stub_gen::{
-    derive::{gen_methods_from_python, gen_stub_pyclass, gen_stub_pymethods},
-    inventory::submit,
-};
-use tokio::sync::{Mutex, mpsc};
+mod exception {
+    use pyo3::types::PyTuple;
+    use pyo3::{PyErrArguments, exceptions::PyException, prelude::*};
+    use pyo3_stub_gen::derive::*;

-#[gen_stub_pyclass]
-#[pyclass(name = "Keypair", frozen)]
-#[derive(Clone)]
-pub struct PyKeypair(Keypair);
+    #[gen_stub_pyclass]
+    #[pyclass(frozen, extends=PyException, name="NoPeersSubscribedToTopicError")]
+    pub struct PyNoPeersSubscribedToTopicError {}

-#[gen_stub_pymethods]
-#[pymethods]
-impl PyKeypair {
-    /// Generate a new ed25519 keypair
-    #[staticmethod]
-    fn generate() -> Self {
-        Self(Keypair::generate_ed25519())
-    }
+    impl PyNoPeersSubscribedToTopicError {
+        const MSG: &'static str = "\
+        No peers are currently subscribed to receive messages on this topic. \
+        Wait for peers to subscribe or check your network connectivity.";

-    /// Decode a private key from a protobuf structure and parse it as a `Keypair`.
-    #[staticmethod]
-    fn from_protobuf_encoding(bytes: &Bound<'_, PyBytes>) -> Self {
-        let bytes = Vec::from(bytes.as_bytes());
-        Self(Keypair::from_protobuf_encoding(&bytes).expect("todo"))
-    }
-
-    /// Encode a private key to a protobuf structure.
-    fn to_protobuf_encoding<'py>(&self, py: Python<'py>) -> PyResult<Bound<'py, PyBytes>> {
-        match self.0.to_protobuf_encoding() {
-            Ok(bytes) => Ok(PyBytes::new(py, &bytes)),
-            Err(e) => Err(PyValueError::new_err(e.to_string())),
+        ///   Creates a new  [ `PyErr` ]  of this type.
+        ///
+        ///   [`PyErr`] :  https://docs.rs/pyo3/latest/pyo3/struct.PyErr.html   "PyErr in pyo3"
+        pub(crate) fn new_err() -> PyErr {
+            PyErr::new::<Self, _>(()) // TODO: check if this needs to be replaced???
        }
    }

-    fn to_string(&self) -> String {
-        self.0.public().to_peer_id().to_base58()
+    #[gen_stub_pymethods]
+    #[pymethods]
+    impl PyNoPeersSubscribedToTopicError {
+        #[new]
+        #[pyo3(signature = (*args))]
+        #[allow(unused_variables)]
+        pub(crate) fn new(args: &Bound<'_, PyTuple>) -> Self {
+            Self {}
+        }
+
+        fn __repr__(&self) -> String {
+            format!("PeerId(\"{}\")", Self::MSG)
+        }
+
+        fn __str__(&self) -> String {
+            Self::MSG.to_string()
+        }
+    }
+
+    #[gen_stub_pyclass]
+    #[pyclass(frozen, extends=PyException, name="AllQueuesFullError")]
+    pub struct PyAllQueuesFullError {}
+
+    impl PyAllQueuesFullError {
+        const MSG: &'static str =
+            "All libp2p peers are unresponsive, resend the message or reconnect.";
+
+        ///   Creates a new  [ `PyErr` ]  of this type.
+        ///
+        ///   [`PyErr`] :  https://docs.rs/pyo3/latest/pyo3/struct.PyErr.html   "PyErr in pyo3"
+        pub(crate) fn new_err() -> PyErr {
+            PyErr::new::<Self, _>(()) // TODO: check if this needs to be replaced???
+        }
+    }
+
+    #[gen_stub_pymethods]
+    #[pymethods]
+    impl PyAllQueuesFullError {
+        #[new]
+        #[pyo3(signature = (*args))]
+        #[allow(unused_variables)]
+        pub(crate) fn new(args: &Bound<'_, PyTuple>) -> Self {
+            Self {}
+        }
+
+        fn __repr__(&self) -> String {
+            format!("PeerId(\"{}\")", Self::MSG)
+        }
+
+        fn __str__(&self) -> String {
+            Self::MSG.to_string()
+        }
    }
 }

-struct PeerBuilder(
-    String,
-    Keypair,
-    mpsc::Sender<FromSwarm>,
-    mpsc::Receiver<ToSwarm>,
-);
+/// Connection or disconnection event discriminant type.
+#[gen_stub_pyclass_enum]
+#[pyclass(eq, eq_int, name = "ConnectionUpdateType")]
+#[derive(Debug, Clone, PartialEq)]
+enum PyConnectionUpdateType {
+    Connected = 0,
+    Disconnected,
+}

 #[gen_stub_pyclass]
-#[pyclass]
-pub struct PyPeer {
-    peer: TakeOnce<PeerBuilder>,
-    to_swarm: mpsc::Sender<ToSwarm>,
-    from_swarm: Mutex<mpsc::Receiver<FromSwarm>>,
+#[pyclass(frozen, name = "ConnectionUpdate")]
+#[derive(Debug, Clone)]
+struct PyConnectionUpdate {
+    /// Whether this is a connection or disconnection event
+    #[pyo3(get)]
+    update_type: PyConnectionUpdateType,
+
+    /// Identity of the peer that we have connected to or disconnected from.
+    #[pyo3(get)]
+    peer_id: PyPeerId,
+
+    /// Remote connection's IPv4 address.
+    #[pyo3(get)]
+    remote_ipv4: String,
+
+    /// Remote connection's TCP port.
+    #[pyo3(get)]
+    remote_tcp_port: u16,
+}
+
+enum ToTask {
+    GossipsubSubscribe {
+        topic: String,
+        result_tx: oneshot::Sender<PyResult<bool>>,
+    },
+    GossipsubUnsubscribe {
+        topic: String,
+        result_tx: oneshot::Sender<bool>,
+    },
+    GossipsubPublish {
+        topic: String,
+        data: Vec<u8>,
+        result_tx: oneshot::Sender<PyResult<MessageId>>,
+    },
+}
+
+#[allow(clippy::enum_glob_use)]
+async fn networking_task(
+    mut swarm: networking::swarm::Swarm,
+    mut to_task_rx: mpsc::Receiver<ToTask>,
+    connection_update_tx: mpsc::Sender<PyConnectionUpdate>,
+    gossipsub_message_tx: mpsc::Sender<(String, Vec<u8>)>,
+) {
+    use SwarmEvent::*;
+    use ToTask::*;
+    use mdns::Event::*;
+    use networking::swarm::BehaviourEvent::*;
+
+    log::info!("RUST: networking task started");
+
+    loop {
+        tokio::select! {
+            message = to_task_rx.recv() => {
+                // handle closed channel
+                let Some(message) = message else {
+                    log::info!("RUST: channel closed");
+                    break;
+                };
+
+                // dispatch incoming messages
+                match message {
+                    GossipsubSubscribe { topic, result_tx } => {
+                        // try to subscribe
+                        let result = swarm.behaviour_mut()
+                            .gossipsub.subscribe(&IdentTopic::new(topic));
+
+                        // send response oneshot
+                        if let Err(e) = result_tx.send(result.pyerr()) {
+                            log::error!("RUST: could not subscribe to gossipsub topic since channel already closed: {e:?}");
+                            continue;
+                        }
+                    }
+                    GossipsubUnsubscribe { topic, result_tx } => {
+                        // try to unsubscribe from the topic
+                        let result = swarm.behaviour_mut()
+                            .gossipsub.unsubscribe(&IdentTopic::new(topic));
+
+                        // send response oneshot (or exit if connection closed)
+                        if let Err(e) = result_tx.send(result) {
+                            log::error!("RUST: could not unsubscribe from gossipsub topic since channel already closed: {e:?}");
+                            continue;
+                        }
+                    }
+                    GossipsubPublish { topic, data, result_tx } => {
+                        // try to publish the data -> catch NoPeersSubscribedToTopic error & convert to correct exception
+                        let result = swarm.behaviour_mut().gossipsub.publish(
+                            IdentTopic::new(topic), data);
+                        let pyresult: PyResult<MessageId> = if let Err(PublishError::NoPeersSubscribedToTopic) = result {
+                            Err(exception::PyNoPeersSubscribedToTopicError::new_err())
+                        } else if let Err(PublishError::AllQueuesFull(_)) = result {
+                            Err(exception::PyAllQueuesFullError::new_err())
+                        } else {
+                            result.pyerr()
+                        };
+
+                        // send response oneshot (or exit if connection closed)
+                        if let Err(e) = result_tx.send(pyresult) {
+                            log::error!("RUST: could not publish gossipsub message since channel already closed: {e:?}");
+                            continue;
+                        }
+                    }
+                }
+            }
+
+            // architectural solution to this problem:
+            // create keep_alive behavior who's job it is to dial peers discovered by mDNS (and drop when expired)
+            //   -> it will emmit TRUE connected/disconnected events consumable elsewhere
+            //
+            // gossipsub will feed off-of dial attempts created by networking, and that will bootstrap its' peers list
+            // then for actual communication it will dial those peers if need-be
+            swarm_event = swarm.select_next_some() => {
+                match swarm_event {
+                    Behaviour(Gossipsub(gossipsub::Event::Message {
+                        message: Message {
+                            topic,
+                            data,
+                            ..
+                        },
+                        ..
+                    })) => {
+                        // topic-ID is just the topic hash!!! (since we used identity hasher)
+                        let message = (topic.into_string(), data);
+
+                        // send incoming message to channel (or exit if connection closed)
+                        if let Err(e) = gossipsub_message_tx.send(message).await {
+                            log::error!("RUST: could not send incoming gossipsub message since channel already closed: {e}");
+                            continue;
+                        }
+                    },
+                    Behaviour(Discovery(discovery::Event::ConnectionEstablished { peer_id, remote_ip, remote_tcp_port, .. })) => {
+                        // grab IPv4 string
+                        let remote_ipv4 = match remote_ip {
+                            IpAddr::V4(ip) => ip.to_string(),
+                            IpAddr::V6(ip) => {
+                                log::warn!("RUST: ignoring connection to IPv6 address: {ip}");
+                                continue;
+                            }
+                        };
+
+                        // send connection event to channel (or exit if connection closed)
+                        if let Err(e) = connection_update_tx.send(PyConnectionUpdate {
+                            update_type: PyConnectionUpdateType::Connected,
+                            peer_id: PyPeerId(peer_id),
+                            remote_ipv4,
+                            remote_tcp_port,
+                        }).await {
+                            log::error!("RUST: could not send connection update since channel already closed: {e}");
+                            continue;
+                        }
+                    },
+                    Behaviour(Discovery(discovery::Event::ConnectionClosed { peer_id, remote_ip, remote_tcp_port, .. })) => {
+                        // grab IPv4 string
+                        let remote_ipv4 = match remote_ip {
+                            IpAddr::V4(ip) => ip.to_string(),
+                            IpAddr::V6(ip) => {
+                                log::warn!("RUST: ignoring disconnection from IPv6 address: {ip}");
+                                continue;
+                            }
+                        };
+
+                        // send disconnection event to channel (or exit if connection closed)
+                        if let Err(e) = connection_update_tx.send(PyConnectionUpdate {
+                            update_type: PyConnectionUpdateType::Disconnected,
+                            peer_id: PyPeerId(peer_id),
+                            remote_ipv4,
+                            remote_tcp_port,
+                        }).await {
+                            log::error!("RUST: could not send connection update since channel already closed: {e}");
+                            continue;
+                        }
+                    },
+                    e => {
+                        log::info!("RUST: other event {e:?}");
+                    }
+                }
+            }
+        }
+    }
+
+    log::info!("RUST: networking task stopped");
+}
+
+#[gen_stub_pyclass]
+#[pyclass(name = "NetworkingHandle")]
+#[derive(Debug)]
+struct PyNetworkingHandle {
+    // channels
+    to_task_tx: Option<mpsc::Sender<ToTask>>,
+    connection_update_rx: Mutex<mpsc::Receiver<PyConnectionUpdate>>,
+    gossipsub_message_rx: Mutex<mpsc::Receiver<(String, Vec<u8>)>>,
+}
+
+impl Drop for PyNetworkingHandle {
+    fn drop(&mut self) {
+        // TODO: may or may not need to await a "kill-signal" oneshot channel message,
+        //       to ensure that the networking task is done BEFORE exiting the clear function...
+        //       but this may require GIL?? and it may not be safe to call GIL here??
+        self.to_task_tx = None; // Using Option<T> as a trick to force channel to be dropped
+    }
+}
+
+#[allow(clippy::expect_used)]
+impl PyNetworkingHandle {
+    fn new(
+        to_task_tx: mpsc::Sender<ToTask>,
+        connection_update_rx: mpsc::Receiver<PyConnectionUpdate>,
+        gossipsub_message_rx: mpsc::Receiver<(String, Vec<u8>)>,
+    ) -> Self {
+        Self {
+            to_task_tx: Some(to_task_tx),
+            connection_update_rx: Mutex::new(connection_update_rx),
+            gossipsub_message_rx: Mutex::new(gossipsub_message_rx),
+        }
+    }
+
+    const fn to_task_tx(&self) -> &mpsc::Sender<ToTask> {
+        self.to_task_tx
+            .as_ref()
+            .expect("The sender should only be None after de-initialization.")
+    }
 }

 #[gen_stub_pymethods]
 #[pymethods]
-impl PyPeer {
-    #[staticmethod]
-    fn new(kp: PyKeypair, namespace: String) -> PyResult<Self> {
-        let (to_client, from_swarm) = mpsc::channel(1024);
-        let (to_swarm, from_client) = mpsc::channel(1024);
-        Ok(Self {
-            peer: TakeOnce::new(PeerBuilder(namespace, kp.0, to_client, from_client)),
-            to_swarm,
-            from_swarm: Mutex::new(from_swarm),
-        })
+impl PyNetworkingHandle {
+    // NOTE: `async fn`s here that use `.await` will wrap the future in `.allow_threads_py()`
+    //       immediately beforehand to release the interpreter.
+    //       SEE: https://pyo3.rs/v0.26.0/async-await.html#detaching-from-the-interpreter-across-await
+
+    // ---- Lifecycle management methods ----
+
+    #[new]
+    fn py_new(identity: Bound<'_, PyKeypair>) -> PyResult<Self> {
+        use pyo3_async_runtimes::tokio::get_runtime;
+
+        // create communication channels
+        let (to_task_tx, to_task_rx) = mpsc::channel(MPSC_CHANNEL_SIZE);
+        let (connection_update_tx, connection_update_rx) = mpsc::channel(MPSC_CHANNEL_SIZE);
+        let (gossipsub_message_tx, gossipsub_message_rx) = mpsc::channel(MPSC_CHANNEL_SIZE);
+
+        // get identity
+        let identity = identity.borrow().0.clone();
+
+        // create networking swarm (within tokio context!! or it crashes)
+        let swarm = get_runtime()
+            .block_on(async { create_swarm(identity) })
+            .pyerr()?;
+
+        // spawn tokio task running the networking logic
+        get_runtime().spawn(async move {
+            networking_task(
+                swarm,
+                to_task_rx,
+                connection_update_tx,
+                gossipsub_message_tx,
+            )
+            .await;
+        });
+        Ok(Self::new(
+            to_task_tx,
+            connection_update_rx,
+            gossipsub_message_rx,
+        ))
    }

    #[gen_stub(skip)]
-    async fn run(&self, #[pyo3(cancel_handle)] mut cancel: CancelHandle) -> PyResult<()> {
-        let builder = self
-            .peer
-            .take()
-            .ok_or_else(|| PyRuntimeError::new_err("tried to run peer twice"))?;
-        let jh = pyo3_async_runtimes::tokio::get_runtime()
-            .spawn(async move {
-                let mut peer =
-                    Peer::new(builder.0, builder.1, builder.2, builder.3).map_err(|_| {
-                        PyConnectionError::new_err("peer failed to listen on default address")
-                    })?;
-                peer.run()
-                    .await
-                    .map_err(|()| PyConnectionError::new_err("peer communication closed"))
+    const fn __traverse__(&self, _visit: PyVisit<'_>) -> Result<(), PyTraverseError> {
+        Ok(()) // This is needed purely so `__clear__` can work
+    }
+
+    #[gen_stub(skip)]
+    fn __clear__(&mut self) {
+        // TODO: may or may not need to await a "kill-signal" oneshot channel message,
+        //       to ensure that the networking task is done BEFORE exiting the clear function...
+        //       but this may require GIL?? and it may not be safe to call GIL here??
+        self.to_task_tx = None; // Using Option<T> as a trick to force channel to be dropped
+    }
+
+    // ---- Connection update receiver methods ----
+
+    /// Receives the next `ConnectionUpdate` from networking.
+    async fn connection_update_recv(&self) -> PyResult<PyConnectionUpdate> {
+        self.connection_update_rx
+            .lock()
+            .allow_threads_py() // allow-threads-aware async call
+            .await
+            .recv_py()
+            .allow_threads_py() // allow-threads-aware async call
+            .await
+    }
+
+    /// Receives at most `limit` `ConnectionUpdate`s from networking and returns them.
+    ///
+    /// For `limit = 0`, an empty collection of `ConnectionUpdate`s will be returned immediately.
+    /// For `limit > 0`, if there are no `ConnectionUpdate`s in the channel's queue this method
+    /// will sleep until a `ConnectionUpdate`s is sent.
+    async fn connection_update_recv_many(&self, limit: usize) -> PyResult<Vec<PyConnectionUpdate>> {
+        self.connection_update_rx
+            .lock()
+            .allow_threads_py() // allow-threads-aware async call
+            .await
+            .recv_many_py(limit)
+            .allow_threads_py() // allow-threads-aware async call
+            .await
+    }
+
+    // TODO: rn this blocks main thread if anything else is awaiting the channel (bc its a mutex)
+    //       so its too dangerous to expose just yet. figure out a better semantics for handling this,
+    //       so things don't randomly block
+    // /// Tries to receive the next `ConnectionUpdate` from networking.
+    // fn connection_update_try_recv(&self) -> PyResult<Option<PyConnectionUpdate>> {
+    //     self.connection_update_rx.blocking_lock().try_recv_py()
+    // }
+    //
+    // /// Checks if the `ConnectionUpdate` channel is empty.
+    // fn connection_update_is_empty(&self) -> bool {
+    //     self.connection_update_rx.blocking_lock().is_empty()
+    // }
+    //
+    // /// Returns the number of `ConnectionUpdate`s in the channel.
+    // fn connection_update_len(&self) -> usize {
+    //     self.connection_update_rx.blocking_lock().len()
+    // }
+
+    // ---- Gossipsub management methods ----
+
+    /// Subscribe to a `GossipSub` topic.
+    ///
+    /// Returns `True` if the subscription worked. Returns `False` if we were already subscribed.
+    async fn gossipsub_subscribe(&self, topic: String) -> PyResult<bool> {
+        let (tx, rx) = oneshot::channel();
+
+        // send off request to subscribe
+        self.to_task_tx()
+            .send_py(ToTask::GossipsubSubscribe {
+                topic,
+                result_tx: tx,
            })
-            .or(async {
-                cancel.cancelled().await;
-                Ok(Ok(()))
-            });
-        match AllowThreads(pin!(jh)).await {
-            Err(e) if e.is_cancelled() => Ok(()),
-            Err(e) if e.is_panic() => Err(PyRuntimeError::new_err(format!("tokio panic {e}"))),
-            Err(_) => unreachable!(),
-            Ok(res) => res,
-        }
+            .allow_threads_py() // allow-threads-aware async call
+            .await?;
+
+        // wait for response & return any errors
+        rx.allow_threads_py() // allow-threads-aware async call
+            .await
+            .map_err(|_| PyErr::receiver_channel_closed())?
    }

-    async fn subscribe(&self, topic: String) -> PyResult<()> {
-        self.to_swarm
-            .send(ToSwarm::Subscribe(topic))
+    /// Unsubscribes from a `GossipSub` topic.
+    ///
+    /// Returns `True` if we were subscribed to this topic. Returns `False` if we were not subscribed.
+    async fn gossipsub_unsubscribe(&self, topic: String) -> PyResult<bool> {
+        let (tx, rx) = oneshot::channel();
+
+        // send off request to unsubscribe
+        self.to_task_tx()
+            .send_py(ToTask::GossipsubUnsubscribe {
+                topic,
+                result_tx: tx,
+            })
+            .allow_threads_py() // allow-threads-aware async call
+            .await?;
+
+        // wait for response & convert any errors
+        rx.allow_threads_py() // allow-threads-aware async call
            .await
-            .map_err(|_| PyRuntimeError::new_err("swarm communication closed"))
-    }
-    async fn unsubscribe(&self, topic: String) -> PyResult<()> {
-        self.to_swarm
-            .send(ToSwarm::Unsubscribe(topic))
-            .await
-            .map_err(|_| PyRuntimeError::new_err("swarm communication closed"))
-    }
-    async fn send(&self, topic: String, payload: Py<PyBytes>) -> PyResult<()> {
-        // this function attaches to the python interpreter synchronously to avoid holding the GIL
-        let bytes = Python::attach(|py| Vec::from(payload.bind(py).as_bytes()));
-        self.to_swarm
-            .send(ToSwarm::Message(topic, bytes))
-            .await
-            .map_err(|_| PyRuntimeError::new_err("swarm communication closed"))
+            .map_err(|_| PyErr::receiver_channel_closed())
    }

-    #[gen_stub(skip)]
-    async fn recv(
-        &self,
-        #[pyo3(cancel_handle)] mut cancel: CancelHandle,
-    ) -> PyResult<PySwarmEvent> {
-        loop {
-            return match AllowThreads(pin!(
-                self.from_swarm
-                    .try_lock()
-                    .map_err(|_| PyRuntimeError::new_err("tried to recv twice"))?
-                    .recv()
-                    .or(async {
-                        cancel.cancelled().await;
-                        None
-                    })
-            ))
+    /// Publishes a message with multiple topics to the `GossipSub` network.
+    ///
+    /// If no peers are found that subscribe to this topic, throws `NoPeersSubscribedToTopicError` exception.
+    async fn gossipsub_publish(&self, topic: String, data: Py<PyBytes>) -> PyResult<()> {
+        let (tx, rx) = oneshot::channel();
+
+        // send off request to subscribe
+        let data = Python::with_gil(|py| Vec::from(data.as_bytes(py)));
+        self.to_task_tx()
+            .send_py(ToTask::GossipsubPublish {
+                topic,
+                data,
+                result_tx: tx,
+            })
+            .allow_threads_py() // allow-threads-aware async call
+            .await?;
+
+        // wait for response & return any errors => ignore messageID for now!!!
+        let _ = rx
+            .allow_threads_py() // allow-threads-aware async call
            .await
-            {
-                Some(FromSwarm::PublishError(p)) => match p {
-                    PublishError::AllQueuesFull(_) => {
-                        Err(PyConnectionError::new_err("swarm overloaded"))
-                    }
-                    PublishError::MessageTooLarge => {
-                        Err(PyValueError::new_err("message too large"))
-                    }
-                    PublishError::NoPeersSubscribedToTopic => {
-                        continue;
-                    }
-                    // TODO(evan): logs here
-                    _ => continue,
-                },
-                None => Err(PyRuntimeError::new_err("swarm communication closed")),
-                Some(fs) => Ok(PySwarmEvent(fs)),
-            };
-        }
+            .map_err(|_| PyErr::receiver_channel_closed())??;
+        Ok(())
    }
+
+    // ---- Gossipsub message receiver methods ----
+
+    /// Receives the next message from the `GossipSub` network.
+    async fn gossipsub_recv(&self) -> PyResult<(String, Py<PyBytes>)> {
+        self.gossipsub_message_rx
+            .lock()
+            .allow_threads_py() // allow-threads-aware async call
+            .await
+            .recv_py()
+            .allow_threads_py() // allow-threads-aware async call
+            .await
+            .map(|(t, d)| (t, d.pybytes()))
+    }
+
+    /// Receives at most `limit` messages from the `GossipSub` network and returns them.
+    ///
+    /// For `limit = 0`, an empty collection of messages will be returned immediately.
+    /// For `limit > 0`, if there are no messages in the channel's queue this method
+    /// will sleep until a message is sent.
+    async fn gossipsub_recv_many(&self, limit: usize) -> PyResult<Vec<(String, Py<PyBytes>)>> {
+        Ok(self
+            .gossipsub_message_rx
+            .lock()
+            .allow_threads_py() // allow-threads-aware async call
+            .await
+            .recv_many_py(limit)
+            .allow_threads_py() // allow-threads-aware async call
+            .await?
+            .into_iter()
+            .map(|(t, d)| (t, d.pybytes()))
+            .collect())
+    }
+
+    // TODO: rn this blocks main thread if anything else is awaiting the channel (bc its a mutex)
+    //       so its too dangerous to expose just yet. figure out a better semantics for handling this,
+    //       so things don't randomly block
+    // /// Tries to receive the next message from the `GossipSub` network.
+    // fn gossipsub_try_recv(&self) -> PyResult<Option<(String, Py<PyBytes>)>> {
+    //     Ok(self
+    //         .gossipsub_message_rx
+    //         .blocking_lock()
+    //         .try_recv_py()?
+    //         .map(|(t, d)| (t, d.pybytes())))
+    // }
+    //
+    // /// Checks if the `GossipSub` message channel is empty.
+    // fn gossipsub_is_empty(&self) -> bool {
+    //     self.gossipsub_message_rx.blocking_lock().is_empty()
+    // }
+    //
+    // /// Returns the number of `GossipSub` messages in the channel.
+    // fn gossipsub_len(&self) -> usize {
+    //     self.gossipsub_message_rx.blocking_lock().len()
+    // }
 }

-// Manually submit the run()/recv() stub because the cancelhandle is poorly understood
-submit! {
-    gen_methods_from_python! {
-        r#"
-        class PyPeer:
-            async def run(self): ...
-            async def recv(self) -> PySwarmEvent: ...
-        "#
-    }
-}
+pub fn networking_submodule(m: &Bound<'_, PyModule>) -> PyResult<()> {
+    m.add_class::<exception::PyNoPeersSubscribedToTopicError>()?;
+    m.add_class::<exception::PyAllQueuesFullError>()?;

-#[gen_stub_pyclass]
-#[pyclass]
-pub struct PySwarmEvent(FromSwarm);
+    m.add_class::<PyConnectionUpdateType>()?;
+    m.add_class::<PyConnectionUpdate>()?;
+    m.add_class::<PyConnectionUpdateType>()?;
+    m.add_class::<PyNetworkingHandle>()?;

-#[gen_stub_pymethods]
-#[pymethods]
-impl PySwarmEvent {
-    // probably a better way to do this, but...
-    fn downcast_discovered(&self) -> Option<String> {
-        if let FromSwarm::Discovered(peer_id) = self.0 {
-            Some(peer_id.to_base58())
-        } else {
-            None
-        }
-    }
-    fn downcast_expired(&self) -> Option<String> {
-        if let FromSwarm::Expired(peer_id) = self.0 {
-            Some(peer_id.to_base58())
-        } else {
-            None
-        }
-    }
-    fn downcast_message<'py>(
-        &self,
-        py: Python<'py>,
-    ) -> Option<(String, String, Bound<'py, PyBytes>)> {
-        if let FromSwarm::Message(peer_id, topic, data) = &self.0 {
-            Some((peer_id.to_base58(), topic.clone(), PyBytes::new(py, data)))
-        } else {
-            None
-        }
-    }
+    Ok(())
 }
--- a/rust/exo_pyo3_bindings/src/pylibp2p/ident.rs
+++ b/rust/exo_pyo3_bindings/src/pylibp2p/ident.rs
@@ -0,0 +1,159 @@
+use crate::ext::ResultExt as _;
+use libp2p::PeerId;
+use libp2p::identity::Keypair;
+use pyo3::prelude::{PyBytesMethods as _, PyModule, PyModuleMethods as _};
+use pyo3::types::PyBytes;
+use pyo3::{Bound, PyResult, Python, pyclass, pymethods};
+use pyo3_stub_gen::derive::{gen_stub_pyclass, gen_stub_pymethods};
+
+/// Identity keypair of a node.
+#[gen_stub_pyclass]
+#[pyclass(name = "Keypair", frozen)]
+#[repr(transparent)]
+pub struct PyKeypair(pub Keypair);
+
+#[gen_stub_pymethods]
+#[pymethods]
+#[allow(clippy::needless_pass_by_value)]
+impl PyKeypair {
+    /// Generate a new Ed25519 keypair.
+    #[staticmethod]
+    fn generate_ed25519() -> Self {
+        Self(Keypair::generate_ed25519())
+    }
+
+    /// Generate a new ECDSA keypair.
+    #[staticmethod]
+    fn generate_ecdsa() -> Self {
+        Self(Keypair::generate_ecdsa())
+    }
+
+    /// Generate a new Secp256k1 keypair.
+    #[staticmethod]
+    fn generate_secp256k1() -> Self {
+        Self(Keypair::generate_secp256k1())
+    }
+
+    /// Decode a private key from a protobuf structure and parse it as a `Keypair`.
+    #[staticmethod]
+    fn from_protobuf_encoding(bytes: Bound<'_, PyBytes>) -> PyResult<Self> {
+        let bytes = Vec::from(bytes.as_bytes());
+        Ok(Self(Keypair::from_protobuf_encoding(&bytes).pyerr()?))
+    }
+
+    /// Decode an keypair from a DER-encoded secret key in PKCS#8 `PrivateKeyInfo`
+    /// format (i.e. unencrypted) as defined in [RFC5208].
+    ///
+    /// [RFC5208]: https://tools.ietf.org/html/rfc5208#section-5
+    #[staticmethod]
+    fn rsa_from_pkcs8(bytes: Bound<'_, PyBytes>) -> PyResult<Self> {
+        let mut bytes = Vec::from(bytes.as_bytes());
+        Ok(Self(Keypair::rsa_from_pkcs8(&mut bytes).pyerr()?))
+    }
+
+    /// Decode a keypair from a DER-encoded Secp256k1 secret key in an `ECPrivateKey`
+    /// structure as defined in [RFC5915].
+    ///
+    /// [RFC5915]: https://tools.ietf.org/html/rfc5915
+    #[staticmethod]
+    fn secp256k1_from_der(bytes: Bound<'_, PyBytes>) -> PyResult<Self> {
+        let mut bytes = Vec::from(bytes.as_bytes());
+        Ok(Self(Keypair::secp256k1_from_der(&mut bytes).pyerr()?))
+    }
+
+    #[staticmethod]
+    fn ed25519_from_bytes(bytes: Bound<'_, PyBytes>) -> PyResult<Self> {
+        let mut bytes = Vec::from(bytes.as_bytes());
+        Ok(Self(Keypair::ed25519_from_bytes(&mut bytes).pyerr()?))
+    }
+
+    /// Encode a private key as protobuf structure.
+    fn to_protobuf_encoding<'py>(&self, py: Python<'py>) -> PyResult<Bound<'py, PyBytes>> {
+        let bytes = self.0.to_protobuf_encoding().pyerr()?;
+        Ok(PyBytes::new(py, &bytes))
+    }
+
+    /// Convert the `Keypair` into the corresponding `PeerId`.
+    fn to_peer_id(&self) -> PyPeerId {
+        PyPeerId(self.0.public().to_peer_id())
+    }
+
+    // /// Hidden constructor for pickling support. TODO: figure out how to do pickling...
+    // #[gen_stub(skip)]
+    // #[new]
+    // fn py_new(bytes: Bound<'_, PyBytes>) -> PyResult<Self> {
+    //     Self::from_protobuf_encoding(bytes)
+    // }
+    //
+    // #[gen_stub(skip)]
+    // fn __setstate__(&mut self, state: Bound<'_, PyBytes>) -> PyResult<()> {
+    //     *self = Self::from_protobuf_encoding(state)?;
+    //     Ok(())
+    // }
+    //
+    // #[gen_stub(skip)]
+    // fn __getstate__<'py>(&self, py: Python<'py>) -> PyResult<Bound<'py, PyBytes>> {
+    //     self.to_protobuf_encoding(py)
+    // }
+    //
+    // #[gen_stub(skip)]
+    // pub fn __getnewargs__<'py>(&self, py: Python<'py>) -> PyResult<(Bound<'py, PyBytes>,)> {
+    //     Ok((self.to_protobuf_encoding(py)?,))
+    // }
+}
+
+/// Identifier of a peer of the network.
+///
+/// The data is a `CIDv0` compatible multihash of the protobuf encoded public key of the peer
+/// as specified in [specs/peer-ids](https://github.com/libp2p/specs/blob/master/peer-ids/peer-ids.md).
+#[gen_stub_pyclass]
+#[pyclass(name = "PeerId", frozen)]
+#[derive(Debug, Clone)]
+#[repr(transparent)]
+pub struct PyPeerId(pub PeerId);
+
+#[gen_stub_pymethods]
+#[pymethods]
+#[allow(clippy::needless_pass_by_value)]
+impl PyPeerId {
+    /// Generates a random peer ID from a cryptographically secure PRNG.
+    ///
+    /// This is useful for randomly walking on a DHT, or for testing purposes.
+    #[staticmethod]
+    fn random() -> Self {
+        Self(PeerId::random())
+    }
+
+    /// Parses a `PeerId` from bytes.
+    #[staticmethod]
+    fn from_bytes(bytes: Bound<'_, PyBytes>) -> PyResult<Self> {
+        let bytes = Vec::from(bytes.as_bytes());
+        Ok(Self(PeerId::from_bytes(&bytes).pyerr()?))
+    }
+
+    /// Returns a raw bytes representation of this `PeerId`.
+    fn to_bytes<'py>(&self, py: Python<'py>) -> Bound<'py, PyBytes> {
+        let bytes = self.0.to_bytes();
+        PyBytes::new(py, &bytes)
+    }
+
+    /// Returns a base-58 encoded string of this `PeerId`.
+    fn to_base58(&self) -> String {
+        self.0.to_base58()
+    }
+
+    fn __repr__(&self) -> String {
+        format!("PeerId({})", self.to_base58())
+    }
+
+    fn __str__(&self) -> String {
+        self.to_base58()
+    }
+}
+
+pub fn ident_submodule(m: &Bound<'_, PyModule>) -> PyResult<()> {
+    m.add_class::<PyKeypair>()?;
+    m.add_class::<PyPeerId>()?;
+
+    Ok(())
+}
--- a/rust/exo_pyo3_bindings/src/pylibp2p/mod.rs
+++ b/rust/exo_pyo3_bindings/src/pylibp2p/mod.rs
@@ -0,0 +1,8 @@
+//! A module for exposing Rust's libp2p datatypes over Pyo3
+//!
+//! TODO: right now we are coupled to libp2p's identity, but eventually we want to create our own
+//!       independent identity type of some kind or another. This may require handshaking.
+//!
+
+pub mod ident;
+pub mod multiaddr;
--- a/rust/exo_pyo3_bindings/src/pylibp2p/multiaddr.rs
+++ b/rust/exo_pyo3_bindings/src/pylibp2p/multiaddr.rs
@@ -0,0 +1,81 @@
+use crate::ext::ResultExt as _;
+use libp2p::Multiaddr;
+use pyo3::prelude::{PyBytesMethods as _, PyModule, PyModuleMethods as _};
+use pyo3::types::PyBytes;
+use pyo3::{Bound, PyResult, Python, pyclass, pymethods};
+use pyo3_stub_gen::derive::{gen_stub_pyclass, gen_stub_pymethods};
+use std::str::FromStr as _;
+
+/// Representation of a Multiaddr.
+#[gen_stub_pyclass]
+#[pyclass(name = "Multiaddr", frozen)]
+#[derive(Debug, Clone)]
+#[repr(transparent)]
+pub struct PyMultiaddr(pub Multiaddr);
+
+#[gen_stub_pymethods]
+#[pymethods]
+#[allow(clippy::needless_pass_by_value)]
+impl PyMultiaddr {
+    /// Create a new, empty multiaddress.
+    #[staticmethod]
+    fn empty() -> Self {
+        Self(Multiaddr::empty())
+    }
+
+    /// Create a new, empty multiaddress with the given capacity.
+    #[staticmethod]
+    fn with_capacity(n: usize) -> Self {
+        Self(Multiaddr::with_capacity(n))
+    }
+
+    /// Parse a `Multiaddr` value from its byte slice representation.
+    #[staticmethod]
+    fn from_bytes(bytes: Bound<'_, PyBytes>) -> PyResult<Self> {
+        let bytes = Vec::from(bytes.as_bytes());
+        Ok(Self(Multiaddr::try_from(bytes).pyerr()?))
+    }
+
+    /// Parse a `Multiaddr` value from its string representation.
+    #[staticmethod]
+    fn from_string(string: String) -> PyResult<Self> {
+        Ok(Self(Multiaddr::from_str(&string).pyerr()?))
+    }
+
+    /// Return the length in bytes of this multiaddress.
+    fn len(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Returns true if the length of this multiaddress is 0.
+    fn is_empty(&self) -> bool {
+        self.0.is_empty()
+    }
+
+    /// Return a copy of this [`Multiaddr`]'s byte representation.
+    fn to_bytes<'py>(&self, py: Python<'py>) -> Bound<'py, PyBytes> {
+        let bytes = self.0.to_vec();
+        PyBytes::new(py, &bytes)
+    }
+
+    /// Convert a Multiaddr to a string.
+    fn to_string(&self) -> String {
+        self.0.to_string()
+    }
+
+    #[gen_stub(skip)]
+    fn __repr__(&self) -> String {
+        format!("Multiaddr({})", self.0)
+    }
+
+    #[gen_stub(skip)]
+    fn __str__(&self) -> String {
+        self.to_string()
+    }
+}
+
+pub fn multiaddr_submodule(m: &Bound<'_, PyModule>) -> PyResult<()> {
+    m.add_class::<PyMultiaddr>()?;
+
+    Ok(())
+}
--- a/rust/exo_pyo3_bindings/tests/dummy.rs
+++ b/rust/exo_pyo3_bindings/tests/dummy.rs
@@ -0,0 +1,54 @@
+#[cfg(test)]
+mod tests {
+    use core::mem::drop;
+    use core::option::Option::Some;
+    use core::time::Duration;
+    use tokio;
+    use tokio::sync::mpsc;
+
+    #[tokio::test]
+    async fn test_drop_channel() {
+        struct Ping;
+
+        let (tx, mut rx) = mpsc::channel::<Ping>(10);
+
+        let _ = tokio::spawn(async move {
+            println!("TASK: entered");
+
+            loop {
+                tokio::select! {
+                    result = rx.recv() => {
+                        match result {
+                            Some(_) => {
+                                println!("TASK: pinged");
+                            }
+                            None => {
+                                println!("TASK: closing channel");
+                                break;
+                            }
+                        }
+                    }
+                    _ = tokio::time::sleep(Duration::from_secs_f32(0.1)) => {
+                        println!("TASK: heartbeat");
+                    }
+                }
+            }
+
+            println!("TASK: exited");
+        });
+
+        let tx2 = tx.clone();
+
+        tokio::time::sleep(Duration::from_secs_f32(0.11)).await;
+
+        tx.send(Ping).await.expect("Should not fail");
+        drop(tx);
+
+        tokio::time::sleep(Duration::from_secs_f32(0.11)).await;
+
+        tx2.send(Ping).await.expect("Should not fail");
+        drop(tx2);
+
+        tokio::time::sleep(Duration::from_secs_f32(0.11)).await;
+    }
+}
--- a/rust/networking/Cargo.toml
+++ b/rust/networking/Cargo.toml
@@ -13,14 +13,32 @@ path = "src/lib.rs"
 workspace = true

 [dependencies]
+# datastructures
+either = { workspace = true }
+
+# macro dependencies
+extend = { workspace = true }
+delegate = { workspace = true }
+impl-trait-for-tuples = { workspace = true }
+derive_more = { workspace = true }
+
 # async
 tokio = { workspace = true, features = ["full"] }
+futures = { workspace = true }
+futures-timer = { workspace = true }

 # utility dependencies
+util = { workspace = true }
+thiserror = { workspace = true }
+#internment = { workspace = true }
+#recursion = { workspace = true }
+#generativity = { workspace = true }
+#itertools = { workspace = true }
 tracing-subscriber = { version = "0.3.19", features = ["default", "env-filter"] }
+keccak-const = { workspace = true }

 # tracing/logging
 log = { workspace = true }

 # networking
-libp2p = { workspace = true, features = ["full"] }
+libp2p = { workspace = true, features = ["full"] }
--- a/rust/networking/examples/chatroom.rs
+++ b/rust/networking/examples/chatroom.rs
@@ -1,6 +1,6 @@
-use libp2p::identity;
-use networking::{self, FromSwarm, ToSwarm};
-use tokio::sync::mpsc;
+use futures::stream::StreamExt as _;
+use libp2p::{gossipsub, identity, swarm::SwarmEvent};
+use networking::{discovery, swarm};
 use tokio::{io, io::AsyncBufReadExt as _, select};
 use tracing_subscriber::EnvFilter;
 use tracing_subscriber::filter::LevelFilter;
@@ -12,51 +12,63 @@ async fn main() {
        .try_init();

    // Configure swarm
-    let (to_client, mut from_swarm) = mpsc::channel(20);
-    let (to_swarm, from_client) = mpsc::channel(20);
-    let mut peer = networking::Peer::new(
-        "chatroom!".to_string(),
-        identity::Keypair::generate_ed25519(),
-        to_client,
-        from_client,
-    )
-    .expect("listen error");
+    let mut swarm =
+        swarm::create_swarm(identity::Keypair::generate_ed25519()).expect("Swarm creation failed");

    // Create a Gossipsub topic & subscribe
+    let topic = gossipsub::IdentTopic::new("test-net");
+    swarm
+        .behaviour_mut()
+        .gossipsub
+        .subscribe(&topic)
+        .expect("Subscribing to topic failed");
+
    // Read full lines from stdin
    let mut stdin = io::BufReader::new(io::stdin()).lines();
    println!("Enter messages via STDIN and they will be sent to connected peers using Gossipsub");

-    let jh = tokio::spawn(async move { peer.run().await });
-    _ = to_swarm
-        .send(ToSwarm::Subscribe("chatting".to_string()))
-        .await;
-
    // Kick it off
    loop {
        select! {
            // on gossipsub outgoing
            Ok(Some(line)) = stdin.next_line() => {
-                _ = to_swarm.send(ToSwarm::Message("chatting".to_string(), line.into_bytes())).await;
+                if let Err(e) = swarm
+                    .behaviour_mut().gossipsub
+                    .publish(topic.clone(), line.as_bytes()) {
+                    println!("Publish error: {e:?}");
+                }
            }
-            event = from_swarm.recv() => match event {
+            event = swarm.select_next_some() => match event {
                // on gossipsub incoming
-                Some(FromSwarm::Message(peer_id,_, data)) => println!(
-                        "\n\nGot message: '{}' from peer: {peer_id}\n\n",
-                        String::from_utf8_lossy(&data),
+                SwarmEvent::Behaviour(swarm::BehaviourEvent::Gossipsub(gossipsub::Event::Message {
+                    propagation_source: peer_id,
+                    message_id: id,
+                    message,
+                })) => println!(
+                        "\n\nGot message: '{}' with id: {id} from peer: {peer_id}\n\n",
+                        String::from_utf8_lossy(&message.data),
                    ),

                // on discovery
-                Some(FromSwarm::Discovered(peer_id)) => {
-                    println!("\n\nConnected to: {peer_id}\n\n");
+                SwarmEvent::Behaviour(swarm::BehaviourEvent::Discovery(e)) => match e {
+                    discovery::Event::ConnectionEstablished {
+                        peer_id, connection_id, remote_ip, remote_tcp_port
+                    } => {
+                        println!("\n\nConnected to: {peer_id}; connection ID: {connection_id}; remote IP: {remote_ip}; remote TCP port: {remote_tcp_port}\n\n");
+                    }
+                    discovery::Event::ConnectionClosed {
+                        peer_id, connection_id, remote_ip, remote_tcp_port
+                    } => {
+                        eprintln!("\n\nDisconnected from: {peer_id}; connection ID: {connection_id}; remote IP: {remote_ip}; remote TCP port: {remote_tcp_port}\n\n");
+                    }
                }
-                Some(FromSwarm::Expired(peer_id)) => {
-                    println!("\n\nDisconnected from: {peer_id}\n\n");
-                }
-                Some(FromSwarm::PublishError(e)) => eprintln!("\n\nError {e:?}\n\n"),
-                None => break,
+
+                // ignore outgoing errors: those are normal
+                e@SwarmEvent::OutgoingConnectionError { .. } => { log::debug!("Outgoing connection error: {e:?}"); }
+
+                // otherwise log any other event
+                e => { log::info!("Other event {e:?}"); }
            }
        }
    }
-    _ = jh.await;
 }
--- a/rust/networking/examples/chatroom_manual.rs
+++ b/rust/networking/examples/chatroom_manual.rs
@@ -0,0 +1,127 @@
+// Copyright 2018 Parity Technologies (UK) Ltd.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a
+// copy of this software and associated documentation files (the "Software"),
+// to deal in the Software without restriction, including without limitation
+// the rights to use, copy, modify, merge, publish, distribute, sublicense,
+// and/or sell copies of the Software, and to permit persons to whom the
+// Software is furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in
+// all copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+// OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+// DEALINGS IN THE SOFTWARE.
+
+use futures::stream::StreamExt;
+use libp2p::{
+    gossipsub, mdns, noise,
+    swarm::{NetworkBehaviour, SwarmEvent},
+    tcp, yamux,
+};
+use std::time::Duration;
+use std::{error::Error, hash::Hash};
+use tokio::{io, io::AsyncBufReadExt, select};
+use tracing_subscriber::EnvFilter;
+
+// We create a custom network behaviour that combines Gossipsub and Mdns.
+#[derive(NetworkBehaviour)]
+struct MyBehaviour {
+    gossipsub: gossipsub::Behaviour,
+    mdns: mdns::tokio::Behaviour,
+}
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn Error>> {
+    let _ = tracing_subscriber::fmt()
+        .with_env_filter(EnvFilter::from_default_env())
+        .try_init();
+
+    let mut swarm = libp2p::SwarmBuilder::with_new_identity()
+        .with_tokio()
+        .with_tcp(
+            tcp::Config::default(),
+            noise::Config::new,
+            yamux::Config::default,
+        )?
+        .with_behaviour(|key| {
+            // Set a custom gossipsub configuration
+            let gossipsub_config = gossipsub::ConfigBuilder::default()
+                .heartbeat_interval(Duration::from_secs(10))
+                .validation_mode(gossipsub::ValidationMode::Strict) // This sets the kind of message validation. The default is Strict (enforce message signing)
+                .build()
+                .map_err(io::Error::other)?; // Temporary hack because `build` does not return a proper `std::error::Error`.
+
+            // build a gossipsub network behaviour
+            let gossipsub = gossipsub::Behaviour::new(
+                gossipsub::MessageAuthenticity::Signed(key.clone()),
+                gossipsub_config,
+            )?;
+
+            let mdns =
+                mdns::tokio::Behaviour::new(mdns::Config::default(), key.public().to_peer_id())?;
+            Ok(MyBehaviour { gossipsub, mdns })
+        })?
+        .build();
+
+    println!("Running swarm with identity {}", swarm.local_peer_id());
+
+    // Create a Gossipsub topic
+    let topic = gossipsub::IdentTopic::new("test-net");
+    // subscribes to our topic
+    swarm.behaviour_mut().gossipsub.subscribe(&topic)?;
+
+    // Read full lines from stdin
+    let mut stdin = io::BufReader::new(io::stdin()).lines();
+
+    // Listen on all interfaces and whatever port the OS assigns
+    swarm.listen_on("/ip4/0.0.0.0/tcp/0".parse()?)?;
+
+    println!("Enter messages via STDIN and they will be sent to connected peers using Gossipsub");
+
+    // Kick it off
+    loop {
+        select! {
+            Ok(Some(line)) = stdin.next_line() => {
+                if let Err(e) = swarm
+                    .behaviour_mut().gossipsub
+                    .publish(topic.clone(), line.as_bytes()) {
+                    println!("Publish error: {e:?}");
+                }
+            }
+            event = swarm.select_next_some() => match event {
+                SwarmEvent::Behaviour(MyBehaviourEvent::Mdns(mdns::Event::Discovered(list))) => {
+                    for (peer_id, multiaddr) in list {
+                        println!("mDNS discovered a new peer: {peer_id} on {multiaddr}");
+                        swarm.behaviour_mut().gossipsub.add_explicit_peer(&peer_id);
+                    }
+                },
+                SwarmEvent::Behaviour(MyBehaviourEvent::Mdns(mdns::Event::Expired(list))) => {
+                    for (peer_id, multiaddr) in list {
+                        println!("mDNS discover peer has expired: {peer_id} on {multiaddr}");
+                        swarm.behaviour_mut().gossipsub.remove_explicit_peer(&peer_id);
+                    }
+                },
+                SwarmEvent::Behaviour(MyBehaviourEvent::Gossipsub(gossipsub::Event::Message {
+                    propagation_source: peer_id,
+                    message_id: id,
+                    message,
+                })) => println!(
+                        "Got message: '{}' with id: {id} from peer: {peer_id}",
+                        String::from_utf8_lossy(&message.data),
+                    ),
+                SwarmEvent::NewListenAddr { address, .. } => {
+                    println!("Local node is listening on {address}");
+                }
+                e => {
+                    println!("Other swarm event: {:?}", e);
+                }
+            }
+        }
+    }
+}
--- a/rust/networking/src/RESEARCH_NOTES.txt
+++ b/rust/networking/src/RESEARCH_NOTES.txt
@@ -0,0 +1,44 @@
+https://github.com/ml-explore/mlx/commit/3fe98bacc7640d857acf3539f1d21b47a32e5609
+^raw sockets distributed -> `<net/ndrv.h>` -> https://newosxbook.com/code/xnu-3247.1.106/bsd/net/ndrv.h.auto.html
+--> header file for a networking component found in the macOS kernel (XNU) that defines structures for network device driver registration, specifically the ndrv_demux_desc and ndrv_protocol_desc structures used for demultiplexing protocol data at the network interface level. It specifies how to describe protocol data, such as an Ethernet type or a SNAP header, and how to associate these descriptions with a specific protocol family to receive matching packets.
+--> Used to bind an NDRV socket so that packets that match given protocol demux descriptions can be received.
+--> An NDRV socket is a special kind of socket in the Darwin/macOS operating system's XNU kernel, used for low-level network packet manipulation and binding to specific protocols for packet processing. It allows user-space applications or drivers to directly write Layer 2 (L2) network packets or interact with the network stack at a lower level, often by binding to protocol descriptors like the ndrv_protocol_desc. This type of socket is used for functions such as capturing and injecting packets, especially in network infrastructure software like routers or for kernel-level network monitoring and security tools.
+--> also called PF_NDRV sockets --> https://newosxbook.com/bonus/vol1ch16.html
+----> they are conceptually similar to https://scapy.disruptivelabs.in/networking/socket-interface PF_RAW or PF_PACKET
+
+https://stackoverflow.com/questions/17169298/af-packet-on-osx
+^AF_PACKET duplicates the packets as soon as it receives them from the physical layer (for incoming packets) or just before sending them out to the physical layer (for outgoing packets). -> this is on Linux only
+^it doesn't exist on OS X so you can use /dev/bpfX (Berkeley Packet Filter) for sniffing
+
+https://www.unix.com/man_page/mojave/4/ip/
+^OS X manpages for IP
+
+https://developer.apple.com/documentation/kernel/implementing_drivers_system_extensions_and_kexts
+^driver kit, system extensions & kexts for macOS
+
+----
+
+To set up a Linux system to use a Thunderbolt connection as a network device, connect the two computers with a Thunderbolt cable, load the thunderbolt-net kernel module (usually automatic but modprobe is an option for manual loading), and then the operating system will create virtual Ethernet interfaces (e.g., thunderbolt0) for networking. You can then use standard tools like ifconfig or your desktop environment's network manager to configure these new interfaces for a link-local network.
+--> https://gist.github.com/geosp/80fbd39e617b7d1d9421683df4ea224a
+----> here is a guide on how to set up thunderbolt-ethernet on linux
+----> I may be able to steal the thunderbolt-net code ideas to implement a kernel module for MacOS
+
+https://chatgpt.com/s/t_68af8e41a8548191993281a014f846a7
+^GPT discussion about making socket interface
+
+https://chatgpt.com/s/t_68afb798a85c8191973c02a0fa7a48a3 --> link-local address,,??
+https://chatgpt.com/s/t_68afb02987e08191b2b0044d3667ece2
+^GPT discussion about accessing TB on MacOS low level interactions
+
+--------------------------------
+
+https://www.intel.com/content/www/us/en/support/articles/000098893/software.html
+^Thunderbolt Share & Thunderbolt Networking Mode => intel's equivalent of thunderbolt bridge
+
+
+---------------------------------
+
+https://www.zerotier.com/blog/how-zerotier-eliminated-kernel-extensions-on-macos/
+-->fake ethernet devices on MacOS -> omg??? we can detect thunderbolt bridge, then bind to it, then re-expose it as fake ethernet??
+-->ps: https://chatgpt.com/s/t_68afb2b25fb881919526763fb5d7359c, AF/PF_NDRV are one and the same!!!
+-->https://github.com/zerotier/ZeroTierOne/blob/dev/osdep/MacEthernetTapAgent.c
--- a/rust/networking/src/discovery.rs
+++ b/rust/networking/src/discovery.rs
@@ -0,0 +1,383 @@
+use crate::ext::MultiaddrExt;
+use crate::keep_alive;
+use delegate::delegate;
+use either::Either;
+use futures::FutureExt;
+use futures_timer::Delay;
+use libp2p::core::transport::PortUse;
+use libp2p::core::{ConnectedPoint, Endpoint};
+use libp2p::swarm::behaviour::ConnectionEstablished;
+use libp2p::swarm::dial_opts::DialOpts;
+use libp2p::swarm::{
+    CloseConnection, ConnectionClosed, ConnectionDenied, ConnectionHandler,
+    ConnectionHandlerSelect, ConnectionId, FromSwarm, NetworkBehaviour, THandler, THandlerInEvent,
+    THandlerOutEvent, ToSwarm, dummy,
+};
+use libp2p::{Multiaddr, PeerId, identity, mdns};
+use std::collections::{BTreeSet, HashMap};
+use std::convert::Infallible;
+use std::io;
+use std::net::IpAddr;
+use std::task::{Context, Poll};
+use std::time::Duration;
+use util::wakerdeque::WakerDeque;
+
+const RETRY_CONNECT_INTERVAL: Duration = Duration::from_secs(5);
+
+mod managed {
+    use libp2p::swarm::NetworkBehaviour;
+    use libp2p::{identity, mdns, ping};
+    use std::io;
+    use std::time::Duration;
+
+    const MDNS_RECORD_TTL: Duration = Duration::from_secs(2_500);
+    const MDNS_QUERY_INTERVAL: Duration = Duration::from_secs(1_500);
+    const PING_TIMEOUT: Duration = Duration::from_millis(2_500);
+    const PING_INTERVAL: Duration = Duration::from_millis(2_500);
+
+    #[derive(NetworkBehaviour)]
+    pub struct Behaviour {
+        mdns: mdns::tokio::Behaviour,
+        ping: ping::Behaviour,
+    }
+
+    impl Behaviour {
+        pub fn new(keypair: &identity::Keypair) -> io::Result<Self> {
+            Ok(Self {
+                mdns: mdns_behaviour(keypair)?,
+                ping: ping_behaviour(),
+            })
+        }
+    }
+
+    fn mdns_behaviour(keypair: &identity::Keypair) -> io::Result<mdns::tokio::Behaviour> {
+        use mdns::{Config, tokio};
+
+        // mDNS config => enable IPv6
+        let mdns_config = Config {
+            ttl: MDNS_RECORD_TTL,
+            query_interval: MDNS_QUERY_INTERVAL,
+
+            // enable_ipv6: true, // TODO: for some reason, TCP+mDNS don't work well with ipv6?? figure out how to make work
+            ..Default::default()
+        };
+
+        let mdns_behaviour = tokio::Behaviour::new(mdns_config, keypair.public().to_peer_id());
+        Ok(mdns_behaviour?)
+    }
+
+    fn ping_behaviour() -> ping::Behaviour {
+        ping::Behaviour::new(
+            ping::Config::new()
+                .with_timeout(PING_TIMEOUT)
+                .with_interval(PING_INTERVAL),
+        )
+    }
+}
+
+/// Events for when a listening connection is truly established and truly closed.
+#[derive(Debug, Clone)]
+pub enum Event {
+    ConnectionEstablished {
+        peer_id: PeerId,
+        connection_id: ConnectionId,
+        remote_ip: IpAddr,
+        remote_tcp_port: u16,
+    },
+    ConnectionClosed {
+        peer_id: PeerId,
+        connection_id: ConnectionId,
+        remote_ip: IpAddr,
+        remote_tcp_port: u16,
+    },
+}
+
+/// Discovery behavior that wraps mDNS to produce truly discovered durable peer-connections.
+///
+/// The behaviour operates as such:
+///  1) All true (listening) connections/disconnections are tracked, emitting corresponding events
+///     to the swarm.
+///  1) mDNS discovered/expired peers are tracked; discovered but not connected peers are dialed
+///     immediately, and expired but connected peers are disconnected from immediately.
+///  2) Every fixed interval: discovered but not connected peers are dialed, and expired but
+///     connected peers are disconnected from.
+pub struct Behaviour {
+    // state-tracking for managed behaviors & mDNS-discovered peers
+    managed: managed::Behaviour,
+    mdns_discovered: HashMap<PeerId, BTreeSet<Multiaddr>>,
+
+    retry_delay: Delay, // retry interval
+
+    // pending events to emmit => waker-backed Deque to control polling
+    pending_events: WakerDeque<ToSwarm<Event, Infallible>>,
+}
+
+impl Behaviour {
+    pub fn new(keypair: &identity::Keypair) -> io::Result<Self> {
+        Ok(Self {
+            managed: managed::Behaviour::new(keypair)?,
+            mdns_discovered: HashMap::new(),
+            retry_delay: Delay::new(RETRY_CONNECT_INTERVAL),
+            pending_events: WakerDeque::new(),
+        })
+    }
+
+    fn dial(&mut self, peer_id: PeerId, addr: Multiaddr) {
+        self.pending_events.push_back(ToSwarm::Dial {
+            opts: DialOpts::peer_id(peer_id).addresses(vec![addr]).build(),
+        })
+    }
+
+    fn close_connection(&mut self, peer_id: PeerId, connection: ConnectionId) {
+        // push front to make this IMMEDIATE
+        self.pending_events.push_front(ToSwarm::CloseConnection {
+            peer_id,
+            connection: CloseConnection::One(connection),
+        })
+    }
+
+    fn handle_mdns_discovered(&mut self, peers: Vec<(PeerId, Multiaddr)>) {
+        for (p, ma) in peers {
+            self.dial(p, ma.clone()); // always connect
+
+            // get peer's multi-addresses or insert if missing
+            let Some(mas) = self.mdns_discovered.get_mut(&p) else {
+                self.mdns_discovered.insert(p, BTreeSet::from([ma]));
+                continue;
+            };
+
+            // multiaddress should never already be present - else something has gone wrong
+            let is_new_addr = mas.insert(ma);
+            assert!(is_new_addr, "cannot discover a discovered peer");
+        }
+    }
+
+    fn handle_mdns_expired(&mut self, peers: Vec<(PeerId, Multiaddr)>) {
+        for (p, ma) in peers {
+            // at this point, we *must* have the peer
+            let mas = self
+                .mdns_discovered
+                .get_mut(&p)
+                .expect("nonexistent peer cannot expire");
+
+            // at this point, we *must* have the multiaddress
+            let was_present = mas.remove(&ma);
+            assert!(was_present, "nonexistent multiaddress cannot expire");
+
+            // if empty, remove the peer-id entirely
+            if mas.is_empty() {
+                self.mdns_discovered.remove(&p);
+            }
+        }
+    }
+
+    fn on_connection_established(
+        &mut self,
+        peer_id: PeerId,
+        connection_id: ConnectionId,
+        remote_ip: IpAddr,
+        remote_tcp_port: u16,
+    ) {
+        // send out connected event
+        self.pending_events
+            .push_back(ToSwarm::GenerateEvent(Event::ConnectionEstablished {
+                peer_id,
+                connection_id,
+                remote_ip,
+                remote_tcp_port,
+            }));
+    }
+
+    fn on_connection_closed(
+        &mut self,
+        peer_id: PeerId,
+        connection_id: ConnectionId,
+        remote_ip: IpAddr,
+        remote_tcp_port: u16,
+    ) {
+        // send out disconnected event
+        self.pending_events
+            .push_back(ToSwarm::GenerateEvent(Event::ConnectionClosed {
+                peer_id,
+                connection_id,
+                remote_ip,
+                remote_tcp_port,
+            }));
+    }
+}
+
+impl NetworkBehaviour for Behaviour {
+    type ConnectionHandler =
+        ConnectionHandlerSelect<dummy::ConnectionHandler, THandler<managed::Behaviour>>;
+    type ToSwarm = Event;
+
+    // simply delegate to underlying mDNS behaviour
+
+    delegate! {
+        to self.managed {
+            fn handle_pending_inbound_connection(&mut self, connection_id: ConnectionId, local_addr: &Multiaddr, remote_addr: &Multiaddr) -> Result<(), ConnectionDenied>;
+            fn handle_pending_outbound_connection(&mut self, connection_id: ConnectionId, maybe_peer: Option<PeerId>, addresses: &[Multiaddr], effective_role: Endpoint) -> Result<Vec<Multiaddr>, ConnectionDenied>;
+        }
+    }
+
+    fn handle_established_inbound_connection(
+        &mut self,
+        connection_id: ConnectionId,
+        peer: PeerId,
+        local_addr: &Multiaddr,
+        remote_addr: &Multiaddr,
+    ) -> Result<THandler<Self>, ConnectionDenied> {
+        Ok(ConnectionHandler::select(
+            dummy::ConnectionHandler,
+            self.managed.handle_established_inbound_connection(
+                connection_id,
+                peer,
+                local_addr,
+                remote_addr,
+            )?,
+        ))
+    }
+
+    #[allow(clippy::needless_question_mark)]
+    fn handle_established_outbound_connection(
+        &mut self,
+        connection_id: ConnectionId,
+        peer: PeerId,
+        addr: &Multiaddr,
+        role_override: Endpoint,
+        port_use: PortUse,
+    ) -> Result<THandler<Self>, ConnectionDenied> {
+        Ok(ConnectionHandler::select(
+            dummy::ConnectionHandler,
+            self.managed.handle_established_outbound_connection(
+                connection_id,
+                peer,
+                addr,
+                role_override,
+                port_use,
+            )?,
+        ))
+    }
+
+    fn on_connection_handler_event(
+        &mut self,
+        peer_id: PeerId,
+        connection_id: ConnectionId,
+        event: THandlerOutEvent<Self>,
+    ) {
+        match event {
+            Either::Left(ev) => libp2p::core::util::unreachable(ev),
+            Either::Right(ev) => {
+                self.managed
+                    .on_connection_handler_event(peer_id, connection_id, ev)
+            }
+        }
+    }
+
+    // hook into these methods to drive behavior
+
+    fn on_swarm_event(&mut self, event: FromSwarm) {
+        self.managed.on_swarm_event(event); // let mDNS handle swarm events
+
+        // handle swarm events to update internal state:
+        match event {
+            FromSwarm::ConnectionEstablished(ConnectionEstablished {
+                peer_id,
+                connection_id,
+                endpoint,
+                ..
+            }) => {
+                let remote_address = match endpoint {
+                    ConnectedPoint::Dialer { address, .. } => address,
+                    ConnectedPoint::Listener { send_back_addr, .. } => send_back_addr,
+                };
+
+                if let Some((ip, port)) = remote_address.try_to_tcp_addr() {
+                    // handle connection established event which is filtered correctly
+                    self.on_connection_established(peer_id, connection_id, ip, port)
+                }
+            }
+            FromSwarm::ConnectionClosed(ConnectionClosed {
+                peer_id,
+                connection_id,
+                endpoint,
+                ..
+            }) => {
+                let remote_address = match endpoint {
+                    ConnectedPoint::Dialer { address, .. } => address,
+                    ConnectedPoint::Listener { send_back_addr, .. } => send_back_addr,
+                };
+
+                if let Some((ip, port)) = remote_address.try_to_tcp_addr() {
+                    // handle connection closed event which is filtered correctly
+                    self.on_connection_closed(peer_id, connection_id, ip, port)
+                }
+            }
+
+            // since we are running TCP/IP transport layer, we are assuming that
+            // no address changes can occur, hence encountering one is a fatal error
+            FromSwarm::AddressChange(a) => {
+                unreachable!("unhandlable: address change encountered: {:?}", a)
+            }
+            _ => {}
+        }
+    }
+
+    fn poll(&mut self, cx: &mut Context) -> Poll<ToSwarm<Self::ToSwarm, THandlerInEvent<Self>>> {
+        // delegate to managed behaviors for any behaviors they need to perform
+        match self.managed.poll(cx) {
+            Poll::Ready(ToSwarm::GenerateEvent(e)) => {
+                match e {
+                    // handle discovered and expired events from mDNS
+                    managed::BehaviourEvent::Mdns(e) => match e.clone() {
+                        mdns::Event::Discovered(peers) => {
+                            self.handle_mdns_discovered(peers);
+                        }
+                        mdns::Event::Expired(peers) => {
+                            self.handle_mdns_expired(peers);
+                        }
+                    },
+
+                    // handle ping events => if error then disconnect
+                    managed::BehaviourEvent::Ping(e) => {
+                        if let Err(_) = e.result {
+                            self.close_connection(e.peer, e.connection.clone())
+                        }
+                    }
+                }
+
+                // since we just consumed an event, we should immediately wake just in case
+                // there are more events to come where that came from
+                cx.waker().wake_by_ref();
+            }
+
+            // forward any other mDNS event to the swarm or its connection handler(s)
+            Poll::Ready(e) => {
+                return Poll::Ready(
+                    e.map_out(|_| unreachable!("events returning to swarm already handled"))
+                        .map_in(Either::Right),
+                );
+            }
+
+            Poll::Pending => {}
+        }
+
+        // retry connecting to all mDNS peers periodically (fails safely if already connected)
+        if self.retry_delay.poll_unpin(cx).is_ready() {
+            for (p, mas) in self.mdns_discovered.clone() {
+                for ma in mas {
+                    self.dial(p, ma)
+                }
+            }
+            self.retry_delay.reset(RETRY_CONNECT_INTERVAL) // reset timeout
+        }
+
+        // send out any pending events from our own service
+        if let Some(e) = self.pending_events.pop_front(cx) {
+            return Poll::Ready(e.map_in(Either::Left));
+        }
+
+        // wait for pending events
+        Poll::Pending
+    }
+}
--- a/rust/networking/src/keep_alive.rs
+++ b/rust/networking/src/keep_alive.rs
@@ -0,0 +1,44 @@
+use delegate::delegate;
+use libp2p::swarm::handler::ConnectionEvent;
+use libp2p::swarm::{ConnectionHandlerEvent, SubstreamProtocol, dummy, handler};
+use std::task::{Context, Poll};
+
+/// An implementation of [`ConnectionHandler`] that doesn't handle any protocols, but it keeps
+/// the connection alive.
+#[derive(Clone)]
+#[repr(transparent)]
+pub struct ConnectionHandler(dummy::ConnectionHandler);
+
+impl ConnectionHandler {
+    pub fn new() -> Self {
+        ConnectionHandler(dummy::ConnectionHandler)
+    }
+}
+
+impl handler::ConnectionHandler for ConnectionHandler {
+    // delegate types and implementation mostly to dummy handler
+    type FromBehaviour = <dummy::ConnectionHandler as handler::ConnectionHandler>::FromBehaviour;
+    type ToBehaviour = <dummy::ConnectionHandler as handler::ConnectionHandler>::ToBehaviour;
+    type InboundProtocol =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::InboundProtocol;
+    type OutboundProtocol =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::OutboundProtocol;
+    type InboundOpenInfo =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::InboundOpenInfo;
+    type OutboundOpenInfo =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::OutboundOpenInfo;
+
+    delegate! {
+        to self.0 {
+            fn listen_protocol(&self) -> SubstreamProtocol<Self::InboundProtocol, Self::InboundOpenInfo>;
+            fn poll(&mut self, cx: &mut Context<'_>) -> Poll<ConnectionHandlerEvent<Self::OutboundProtocol, Self::OutboundOpenInfo, Self::ToBehaviour>>;
+            fn on_behaviour_event(&mut self, event: Self::FromBehaviour);
+            fn on_connection_event(&mut self, event: ConnectionEvent<Self::InboundProtocol, Self::OutboundProtocol, Self::InboundOpenInfo, Self::OutboundOpenInfo>);
+        }
+    }
+
+    // specifically override this to force connection to stay alive
+    fn connection_keep_alive(&self) -> bool {
+        true
+    }
+}
--- a/rust/networking/src/lib.rs
+++ b/rust/networking/src/lib.rs
@@ -1,299 +1,64 @@
-use libp2p::{
-    Multiaddr, PeerId,
-    futures::StreamExt,
-    gossipsub::{self, TopicHash},
-    identify,
-    identity::Keypair,
-    mdns,
-    swarm::{NetworkBehaviour, SwarmEvent, dial_opts::DialOpts},
-};
-use std::collections::HashMap;
-use tokio::sync::mpsc;
+//! TODO: crate documentation
+//!
+//! this is here as a placeholder documentation
+//!
+//!

-#[derive(Debug)]
-pub struct ListenError;
+// enable Rust-unstable features for convenience
+#![feature(trait_alias)]
+// #![feature(stmt_expr_attributes)]
+// #![feature(unboxed_closures)]
+// #![feature(assert_matches)]
+// #![feature(async_fn_in_dyn_trait)]
+// #![feature(async_for_loop)]
+// #![feature(auto_traits)]
+// #![feature(negative_impls)]

-pub enum FromSwarm {
-    PublishError(gossipsub::PublishError),
-    Discovered(PeerId),
-    Expired(PeerId),
-    Message(PeerId, String, Vec<u8>),
-}
-pub enum ToSwarm {
-    Message(String, Vec<u8>),
-    Subscribe(String),
-    Unsubscribe(String),
+pub mod discovery;
+pub mod keep_alive;
+pub mod swarm;
+
+/// Namespace for all the type/trait aliases used by this crate.
+pub(crate) mod alias {
+    use std::error::Error;
+
+    pub type AnyError = Box<dyn Error + Send + Sync + 'static>;
+    pub type AnyResult<T> = Result<T, AnyError>;
 }

-pub struct Peer {
-    pub swarm: libp2p::Swarm<Behaviour>,
-    to_client: mpsc::Sender<FromSwarm>,
-    from_client: mpsc::Receiver<ToSwarm>,
-    namespace: String,
-    known_peers: HashMap<PeerId, Vec<Multiaddr>>,
-}
-impl Peer {
-    pub fn new(
-        namespace: String,
-        kp: Keypair,
-        to_client: mpsc::Sender<FromSwarm>,
-        from_client: mpsc::Receiver<ToSwarm>,
-    ) -> Result<Self, ListenError> {
-        let mut swarm = libp2p::SwarmBuilder::with_existing_identity(kp)
-            .with_tokio()
-            .with_quic()
-            // TODO(evan) .with_bandwidth_metrics()
-            .with_behaviour(|kp| Behaviour::new(namespace.clone(), kp))
-            .expect("invalid swarm behaviour")
-            .build();
+/// Namespace for crate-wide extension traits/methods
+pub(crate) mod ext {
+    use extend::ext;
+    use libp2p::Multiaddr;
+    use libp2p::multiaddr::Protocol;
+    use std::net::IpAddr;

-        swarm
-            .listen_on("/ip6/::/udp/0/quic-v1".parse().expect("invalid multiaddr"))
-            .map_err(|_| ListenError)?;
-        swarm
-            .listen_on(
-                "/ip4/0.0.0.0/udp/0/quic-v1"
-                    .parse()
-                    .expect("invalid multiaddr"),
-            )
-            .map_err(|_| ListenError)?;
-        Ok(Self {
-            swarm,
-            to_client,
-            from_client,
-            namespace,
-            known_peers: HashMap::default(),
-        })
-    }
-    pub async fn run(&mut self) -> Result<(), ()> {
-        loop {
-            tokio::select! {
-                event = self.swarm.next() => self.handle_event(event.ok_or(())?).await?,
-                msg = self.from_client.recv() => self.handle_message(msg.ok_or(())?).await?,
-            }
-        }
-    }
-    async fn handle_message(&mut self, message: ToSwarm) -> Result<(), ()> {
-        match message {
-            ToSwarm::Message(topic, data) => {
-                if let Err(e) = self
-                    .swarm
-                    .behaviour_mut()
-                    .gossipsub
-                    .publish(TopicHash::from_raw(topic), data)
-                {
-                    self.to_client
-                        .send(FromSwarm::PublishError(e))
-                        .await
-                        .map_err(|_| ())?;
+    #[ext(pub, name = MultiaddrExt)]
+    impl Multiaddr {
+        /// If the multiaddress corresponds to a TCP address, extracts it
+        fn try_to_tcp_addr(&self) -> Option<(IpAddr, u16)> {
+            let mut ps = self.into_iter();
+            let ip = if let Some(p) = ps.next() {
+                match p {
+                    Protocol::Ip4(ip) => IpAddr::V4(ip),
+                    Protocol::Ip6(ip) => IpAddr::V6(ip),
+                    _ => return None,
                }
-            }
-            ToSwarm::Subscribe(topic) => {
-                match self
-                    .swarm
-                    .behaviour_mut()
-                    .gossipsub
-                    .subscribe(&gossipsub::IdentTopic::new(topic))
-                {
-                    Ok(_) => {}
-                    Err(gossipsub::SubscriptionError::NotAllowed) => {
-                        unreachable!("subscription filter hit")
-                    }
-                    Err(gossipsub::SubscriptionError::PublishError(e)) => self
-                        .to_client
-                        .send(FromSwarm::PublishError(e))
-                        .await
-                        .map_err(|_| ())?,
-                }
-            }
-            ToSwarm::Unsubscribe(topic) => {
-                self.swarm
-                    .behaviour_mut()
-                    .gossipsub
-                    .unsubscribe(&gossipsub::IdentTopic::new(topic));
-            }
-        }
-        Ok(())
-    }
-    async fn handle_event(&mut self, event: SwarmEvent<BehaviourEvent>) -> Result<(), ()> {
-        let SwarmEvent::Behaviour(event) = event else {
-            return Ok(());
-        };
-        match event {
-            BehaviourEvent::Gossipsub(gossipsub::Event::Message { message, .. }) => {
-                if let Some(source) = message.source {
-                    self.to_client
-                        .send(FromSwarm::Message(
-                            source,
-                            message.topic.into_string(),
-                            message.data,
-                        ))
-                        .await
-                        .map_err(|_| ())?;
-                }
-            }
-            BehaviourEvent::Identify(identify::Event::Received { peer_id, info, .. }) => {
-                log::debug!(
-                    "identify from {peer_id}: protocol_version='{}' agent_version='{}' (local namespace='{}')",
-                    info.protocol_version,
-                    info.agent_version,
-                    self.namespace
-                );
-                if info.protocol_version == self.namespace {
-                    self.passed_namespace(peer_id);
-                    self.to_client
-                        .send(FromSwarm::Discovered(peer_id))
-                        .await
-                        .map_err(|_| ())?;
-                } else {
-                    self.failed_namespace(peer_id);
-                }
-            }
-            BehaviourEvent::Mdns(mdns::Event::Discovered(v)) => {
-                for (peer_id, addr) in v {
-                    self.known_peers.entry(peer_id).or_default().push(addr);
-                }
-                for (peer_id, addrs) in &self.known_peers {
-                    // dialopts handles rate limiting, we should check errors if we want to blacklist earlier
-                    let _ = self
-                        .swarm
-                        .dial(DialOpts::peer_id(*peer_id).addresses(addrs.clone()).build());
-                }
-            }
-            BehaviourEvent::Mdns(mdns::Event::Expired(v)) => {
-                for (peer_id, addr) in v {
-                    let addrs = self.known_peers.entry(peer_id).or_default();
-                    addrs.retain(|a| *a != addr);
-                    if addrs.is_empty() {
-                        self.known_peers.remove(&peer_id);
-                        self.swarm
-                            .behaviour_mut()
-                            .gossipsub
-                            .remove_explicit_peer(&peer_id);
-                        self.to_client
-                            .send(FromSwarm::Expired(peer_id))
-                            .await
-                            .map_err(|_| ())?;
-                    }
-                }
-            }
-            _ => {}
-        }
-        Ok(())
-    }
-    fn passed_namespace(&mut self, peer_id: PeerId) {
-        self.swarm
-            .behaviour_mut()
-            .gossipsub
-            .remove_blacklisted_peer(&peer_id);
-        self.swarm
-            .behaviour_mut()
-            .gossipsub
-            .add_explicit_peer(&peer_id);
-    }
-    fn failed_namespace(&mut self, peer_id: PeerId) {
-        self.swarm
-            .behaviour_mut()
-            .gossipsub
-            .blacklist_peer(&peer_id);
-        self.swarm
-            .behaviour_mut()
-            .gossipsub
-            .remove_explicit_peer(&peer_id);
-    }
-}
-
-#[derive(NetworkBehaviour)]
-pub struct Behaviour {
-    gossipsub: gossipsub::Behaviour,
-    mdns: mdns::tokio::Behaviour,
-    identify: identify::Behaviour,
-}
-
-impl Behaviour {
-    fn new(namespace: String, kp: &Keypair) -> Self {
-        let mdns = mdns::Behaviour::new(mdns::Config::default(), kp.public().to_peer_id())
-            .expect("mdns behaviour failed to build");
-
-        let identify =
-            identify::Behaviour::new(identify::Config::new_with_signed_peer_record(namespace, kp));
-
-        let gossipsub = gossipsub::Behaviour::new(
-            gossipsub::MessageAuthenticity::Signed(kp.clone()),
-            gossipsub::ConfigBuilder::default()
-                .max_transmit_size(1024 * 1024)
-                .validation_mode(gossipsub::ValidationMode::Strict)
-                .build()
-                .expect("invalid gossipsub configuration"),
-        )
-        .expect("gossipsub behaviour failed ot build");
-
-        Self {
-            gossipsub,
-            mdns,
-            identify,
+            } else {
+                return None;
+            };
+            let Some(Protocol::Tcp(port)) = ps.next() else {
+                return None;
+            };
+            Some((ip, port))
        }
    }
 }

-// TODO: more tests
-#[cfg(test)]
-mod tests {
-    use super::*;
-    use tokio::time::{Duration, timeout};
+pub(crate) mod private {
+    #![allow(dead_code)]

-    fn make_peer(namespace: &str) -> (Peer, mpsc::Receiver<FromSwarm>, mpsc::Sender<ToSwarm>) {
-        let kp = Keypair::generate_ed25519();
-
-        let (to_client_tx, to_client_rx) = mpsc::channel(64);
-        let (to_peer_tx, to_peer_rx) = mpsc::channel(64);
-
-        let peer = Peer::new(namespace.to_string(), kp, to_client_tx, to_peer_rx)
-            .expect("Peer::new should succeed in tests");
-
-        (peer, to_client_rx, to_peer_tx)
-    }
-
-    async fn next_listen_addr(peer: &mut Peer) -> Multiaddr {
-        loop {
-            match peer.swarm.next().await {
-                Some(SwarmEvent::NewListenAddr { address, .. }) => return address,
-                Some(_) => {}
-                None => panic!("swarm stream ended unexpectedly"),
-            }
-        }
-    }
-
-    #[tokio::test]
-    async fn subscribe_and_unsubscribe_do_not_error() {
-        let (mut peer, mut events_rx, commands_tx) = make_peer("ns-test");
-
-        // Drive the swarm just enough to get at least one listen address event,
-        // so the background run loop has something initialized.
-        let _addr = next_listen_addr(&mut peer).await;
-
-        // Run the peer loop in the background.
-        let handle = tokio::spawn(async move {
-            let _ = peer.run().await;
-        });
-
-        commands_tx
-            .send(ToSwarm::Subscribe("topic-a".to_string()))
-            .await
-            .unwrap();
-
-        commands_tx
-            .send(ToSwarm::Unsubscribe("topic-a".to_string()))
-            .await
-            .unwrap();
-
-        // We don't *require* any FromSwarm events here; this is mainly a
-        // smoke test that the message-handling path doesn't panic/hang.
-        // Still, poll briefly to ensure the task is alive.
-        let _ = timeout(Duration::from_millis(200), events_rx.recv()).await;
-
-        // Shut down: dropping the command sender closes the channel, causing run() to return Err.
-        drop(commands_tx);
-        let _ = handle.await;
-    }
+    /// Sealed traits support
+    pub trait Sealed {}
+    impl<T: ?Sized> Sealed for T {}
 }
--- a/rust/networking/src/swarm.rs
+++ b/rust/networking/src/swarm.rs
@@ -0,0 +1,143 @@
+use crate::alias;
+use crate::swarm::transport::tcp_transport;
+pub use behaviour::{Behaviour, BehaviourEvent};
+use libp2p::{SwarmBuilder, identity};
+
+pub type Swarm = libp2p::Swarm<Behaviour>;
+
+/// The current version of the network: this prevents devices running different versions of the
+/// software from interacting with each other.
+///
+/// TODO: right now this is a hardcoded constant; figure out what the versioning semantics should
+///       even be, and how to inject the right version into this config/initialization. E.g. should
+///       this be passed in as a parameter? What about rapidly changing versions in debug builds?
+///       this is all VERY very hard to figure out and needs to be mulled over as a team.
+pub const NETWORK_VERSION: &[u8] = b"v0.0.1";
+pub const OVERRIDE_VERSION_ENV_VAR: &str = "EXO_LIBP2P_NAMESPACE";
+
+/// Create and configure a swarm which listens to all ports on OS
+pub fn create_swarm(keypair: identity::Keypair) -> alias::AnyResult<Swarm> {
+    let mut swarm = SwarmBuilder::with_existing_identity(keypair)
+        .with_tokio()
+        .with_other_transport(tcp_transport)?
+        .with_behaviour(Behaviour::new)?
+        .build();
+
+    // Listen on all interfaces and whatever port the OS assigns
+    swarm.listen_on("/ip4/0.0.0.0/tcp/0".parse()?)?;
+    Ok(swarm)
+}
+
+mod transport {
+    use crate::alias;
+    use crate::swarm::{NETWORK_VERSION, OVERRIDE_VERSION_ENV_VAR};
+    use futures::{AsyncRead, AsyncWrite};
+    use keccak_const::Sha3_256;
+    use libp2p::core::muxing;
+    use libp2p::core::transport::Boxed;
+    use libp2p::pnet::{PnetError, PnetOutput};
+    use libp2p::{PeerId, Transport, identity, noise, pnet, yamux};
+    use std::{env, sync::LazyLock};
+
+    /// Key used for networking's private network; parametrized on the [`NETWORK_VERSION`].
+    /// See [`pnet_upgrade`] for more.
+    static PNET_PRESHARED_KEY: LazyLock<[u8; 32]> = LazyLock::new(|| {
+        let builder = Sha3_256::new().update(b"exo_discovery_network");
+
+        if let Ok(var) = env::var(OVERRIDE_VERSION_ENV_VAR) {
+            let bytes = var.into_bytes();
+            builder.update(&bytes)
+        } else {
+            builder.update(NETWORK_VERSION)
+        }
+        .finalize()
+    });
+
+    /// Make the Swarm run on a private network, as to not clash with public libp2p nodes and
+    /// also different-versioned instances of this same network.
+    /// This is implemented as an additional "upgrade" ontop of existing [`libp2p::Transport`] layers.
+    async fn pnet_upgrade<TSocket>(
+        socket: TSocket,
+        _: impl Sized,
+    ) -> Result<PnetOutput<TSocket>, PnetError>
+    where
+        TSocket: AsyncRead + AsyncWrite + Send + Unpin + 'static,
+    {
+        use pnet::{PnetConfig, PreSharedKey};
+        PnetConfig::new(PreSharedKey::new(*PNET_PRESHARED_KEY))
+            .handshake(socket)
+            .await
+    }
+
+    /// TCP/IP transport layer configuration.
+    pub fn tcp_transport(
+        keypair: &identity::Keypair,
+    ) -> alias::AnyResult<Boxed<(PeerId, muxing::StreamMuxerBox)>> {
+        use libp2p::{
+            core::upgrade::Version,
+            tcp::{Config, tokio},
+        };
+
+        // `TCP_NODELAY` enabled => avoid latency
+        let tcp_config = Config::default().nodelay(true);
+
+        // V1 + lazy flushing => 0-RTT negotiation
+        let upgrade_version = Version::V1Lazy;
+
+        // Noise is faster than TLS + we don't care much for security
+        let noise_config = noise::Config::new(keypair)?;
+
+        // Use default Yamux config for multiplexing
+        let yamux_config = yamux::Config::default();
+
+        // Create new Tokio-driven TCP/IP transport layer
+        let base_transport = tokio::Transport::new(tcp_config)
+            .and_then(pnet_upgrade)
+            .upgrade(upgrade_version)
+            .authenticate(noise_config)
+            .multiplex(yamux_config);
+
+        // Return boxed transport (to flatten complex type)
+        Ok(base_transport.boxed())
+    }
+}
+
+mod behaviour {
+    use crate::{alias, discovery};
+    use libp2p::swarm::NetworkBehaviour;
+    use libp2p::{gossipsub, identity};
+
+    /// Behavior of the Swarm which composes all desired behaviors:
+    /// Right now its just [`discovery::Behaviour`] and [`gossipsub::Behaviour`].
+    #[derive(NetworkBehaviour)]
+    pub struct Behaviour {
+        pub discovery: discovery::Behaviour,
+        pub gossipsub: gossipsub::Behaviour,
+    }
+
+    impl Behaviour {
+        pub fn new(keypair: &identity::Keypair) -> alias::AnyResult<Self> {
+            Ok(Self {
+                discovery: discovery::Behaviour::new(keypair)?,
+                gossipsub: gossipsub_behaviour(keypair),
+            })
+        }
+    }
+
+    fn gossipsub_behaviour(keypair: &identity::Keypair) -> gossipsub::Behaviour {
+        use gossipsub::{ConfigBuilder, MessageAuthenticity, ValidationMode};
+
+        // build a gossipsub network behaviour
+        //  => signed message authenticity + strict validation mode means the message-ID is
+        //     automatically provided by gossipsub w/out needing to provide custom message-ID function
+        gossipsub::Behaviour::new(
+            MessageAuthenticity::Signed(keypair.clone()),
+            ConfigBuilder::default()
+                .max_transmit_size(1024 * 1024)
+                .validation_mode(ValidationMode::Strict)
+                .build()
+                .expect("the configuration should always be valid"),
+        )
+        .expect("creating gossipsub behavior should always work")
+    }
+}
--- a/rust/networking/tests/dummy.rs
+++ b/rust/networking/tests/dummy.rs
@@ -0,0 +1,7 @@
+// maybe this will hold test in the future...??
+
+#[cfg(test)]
+mod tests {
+    #[test]
+    fn does_nothing() {}
+}
--- a/rust/rust-toolchain.toml
+++ b/rust/rust-toolchain.toml
@@ -0,0 +1,2 @@
+[toolchain]
+channel = "nightly"
--- a/rust/util/Cargo.toml
+++ b/rust/util/Cargo.toml
@@ -0,0 +1,15 @@
+[package]
+name = "util"
+version = { workspace = true }
+edition = { workspace = true }
+publish = false
+
+[lib]
+doctest = false
+name = "util"
+path = "src/lib.rs"
+
+[lints]
+workspace = true
+
+[dependencies]
--- a/rust/util/src/lib.rs
+++ b/rust/util/src/lib.rs
@@ -0,0 +1 @@
+pub mod wakerdeque;
--- a/rust/util/src/wakerdeque.rs
+++ b/rust/util/src/wakerdeque.rs
@@ -0,0 +1,55 @@
+use std::collections::VecDeque;
+use std::fmt::{Debug, Formatter};
+use std::task::{Context, Waker};
+
+/// A wrapper around [`VecDeque`] which wakes (if it can) on any `push_*` methods,
+/// and updates the internally stored waker by consuming [`Context`] on any `pop_*` methods.
+pub struct WakerDeque<T> {
+    waker: Option<Waker>,
+    deque: VecDeque<T>,
+}
+
+impl<T: Debug> Debug for WakerDeque<T> {
+    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+        self.deque.fmt(f)
+    }
+}
+
+impl<T> WakerDeque<T> {
+    pub fn new() -> Self {
+        Self {
+            waker: None,
+            deque: VecDeque::new(),
+        }
+    }
+
+    fn update(&mut self, cx: &mut Context<'_>) {
+        self.waker = Some(cx.waker().clone());
+    }
+
+    fn wake(&mut self) {
+        let Some(ref mut w) = self.waker else { return };
+        w.wake_by_ref();
+        self.waker = None;
+    }
+
+    pub fn pop_front(&mut self, cx: &mut Context<'_>) -> Option<T> {
+        self.update(cx);
+        self.deque.pop_front()
+    }
+
+    pub fn pop_back(&mut self, cx: &mut Context<'_>) -> Option<T> {
+        self.update(cx);
+        self.deque.pop_back()
+    }
+
+    pub fn push_front(&mut self, value: T) {
+        self.wake();
+        self.deque.push_front(value);
+    }
+
+    pub fn push_back(&mut self, value: T) {
+        self.wake();
+        self.deque.push_back(value);
+    }
+}
--- a/src/exo/main.py
+++ b/src/exo/main.py
@@ -1,5 +1,4 @@
 import argparse
-import importlib.metadata
 import itertools
 import multiprocessing as mp
 import os
@@ -45,9 +44,9 @@ class Node:
    @classmethod
    async def create(cls, args: "Args") -> "Self":
        keypair = get_node_id_keypair()
-        node_id = NodeId(keypair.to_string())
+        node_id = NodeId(keypair.to_peer_id().to_base58())
        session_id = SessionId(master_node_id=node_id, election_clock=0)
-        router = Router.create(keypair, namespace=args.namespace)
+        router = Router.create(keypair)
        await router.register_topic(topics.GLOBAL_EVENTS)
        await router.register_topic(topics.LOCAL_EVENTS)
        await router.register_topic(topics.COMMANDS)
@@ -73,7 +72,7 @@ class Node:
        else:
            download_coordinator = None

-        if not args.no_api:
+        if args.spawn_api:
            api = API(
                node_id,
                session_id,
@@ -259,7 +258,7 @@ def main():
    # TODO: Refactor the current verbosity system
    logger_setup(EXO_LOG, args.verbosity)
    logger.info("Starting EXO")
-    logger.info(f"Namespace: {args.namespace}")
+    logger.info(f"EXO_LIBP2P_NAMESPACE: {os.getenv('EXO_LIBP2P_NAMESPACE')}")

    # Set FAST_SYNCH override env var for runner subprocesses
    if args.fast_synch is True:
@@ -276,13 +275,13 @@ def main():


 class Args(CamelCaseModel):
-    verbosity: int
-    force_master: bool
-    no_api: bool
-    api_port: PositiveInt
+    verbosity: int = 0
+    force_master: bool = False
+    spawn_api: bool = False
+    api_port: PositiveInt = 52415
+    tb_only: bool = False
    no_worker: bool = False
    no_downloads: bool = False
-    namespace: str
    fast_synch: bool | None = None  # None = auto, True = force on, False = force off

    @classmethod
@@ -312,15 +311,14 @@ class Args(CamelCaseModel):
        )
        parser.add_argument(
            "--no-api",
-            action="store_true",
-            help="Disable the API server for this node",
+            action="store_false",
+            dest="spawn_api",
        )
        parser.add_argument(
            "--api-port",
            type=int,
            dest="api_port",
            default=52415,
-            help="Which port the API server will be available on",
        )
        parser.add_argument(
            "--no-worker",
@@ -331,11 +329,6 @@ class Args(CamelCaseModel):
            action="store_true",
            help="Disable the download coordinator (node won't download models)",
        )
-        parser.add_argument(
-            "--namespace",
-            default=importlib.metadata.version("exo"),
-            help="Set the EXO namespace to run multiple isolated clusters",
-        )
        fast_synch_group = parser.add_mutually_exclusive_group()
        fast_synch_group.add_argument(
            "--fast-synch",
--- a/src/exo/master/api.py
+++ b/src/exo/master/api.py
@@ -73,6 +73,8 @@ from exo.shared.types.api import (
    CreateInstanceResponse,
    DeleteDownloadResponse,
    DeleteInstanceResponse,
+    DistributeModelParams,
+    DistributeModelResponse,
    ErrorInfo,
    ErrorResponse,
    FinishReason,
@@ -117,6 +119,7 @@ from exo.shared.types.commands import (
    CreateInstance,
    DeleteDownload,
    DeleteInstance,
+    DistributeModel,
    DownloadCommand,
    ForwarderCommand,
    ForwarderDownloadCommand,
@@ -142,6 +145,7 @@ from exo.shared.types.openai_responses import (
    ResponsesResponse,
 )
 from exo.shared.types.state import State
+from exo.shared.types.worker.downloads import DownloadCompleted
 from exo.shared.types.worker.instances import Instance, InstanceId, InstanceMeta
 from exo.shared.types.worker.shards import Sharding
 from exo.utils.banner import print_startup_banner
@@ -298,6 +302,7 @@ class API:
        self.app.get("/events")(self.stream_events)
        self.app.post("/download/start")(self.start_download)
        self.app.delete("/download/{node_id}/{model_id:path}")(self.delete_download)
+        self.app.post("/v1/models/{model_id:path}/distribute")(self.distribute_model)
        self.app.get("/v1/traces")(self.list_traces)
        self.app.get("/v1/traces/{task_id}")(self.get_trace)
        self.app.get("/v1/traces/{task_id}/stats")(self.get_trace_stats)
@@ -1477,6 +1482,57 @@ class API:
        await self._send_download(command)
        return DeleteDownloadResponse(command_id=command.command_id)

+    async def distribute_model(
+        self, model_id: ModelId, payload: DistributeModelParams
+    ) -> DistributeModelResponse:
+        """Distribute model files from one node to others via MLX distributed."""
+        # Find a source node that has the model downloaded
+        source_node_id: NodeId | None = None
+        for nid, downloads in self.state.downloads.items():
+            for dp in downloads:
+                if (
+                    isinstance(dp, DownloadCompleted)
+                    and dp.shard_metadata.model_card.model_id == model_id
+                ):
+                    source_node_id = nid
+                    break
+            if source_node_id is not None:
+                break
+
+        if source_node_id is None:
+            raise HTTPException(
+                status_code=404,
+                detail=f"No node has model {model_id} downloaded",
+            )
+
+        # Determine target nodes
+        if payload.target_node_ids is not None:
+            target_node_ids = [
+                nid for nid in payload.target_node_ids if nid != source_node_id
+            ]
+        else:
+            target_node_ids = [
+                nid for nid in self.state.topology.list_nodes() if nid != source_node_id
+            ]
+
+        if not target_node_ids:
+            raise HTTPException(
+                status_code=400,
+                detail="No target nodes to distribute to",
+            )
+
+        command = DistributeModel(
+            model_id=model_id,
+            source_node_id=source_node_id,
+            target_node_ids=target_node_ids,
+        )
+        await self._send(command)
+
+        return DistributeModelResponse(
+            command_id=command.command_id,
+            message=f"Distributing {model_id} from {source_node_id} to {len(target_node_ids)} node(s)",
+        )
+
    def _get_trace_path(self, task_id: str) -> Path:
        return EXO_TRACING_CACHE_DIR / f"trace_{task_id}.json"

--- a/src/exo/master/main.py
+++ b/src/exo/master/main.py
@@ -17,6 +17,7 @@ from exo.shared.constants import EXO_EVENT_LOG_DIR, EXO_TRACING_ENABLED
 from exo.shared.types.commands import (
    CreateInstance,
    DeleteInstance,
+    DistributeModel,
    ForwarderCommand,
    ForwarderDownloadCommand,
    ImageEdits,
@@ -312,6 +313,37 @@ class Master:
                                self.state.instances, placement
                            )
                            generated_events.extend(transition_events)
+                        case DistributeModel():
+                            from exo.shared.models.model_cards import ModelCard
+                            from exo.shared.types.worker.instances import InstanceMeta
+                            from exo.shared.types.worker.shards import Sharding
+
+                            model_card = await ModelCard.load(command.model_id)
+                            all_node_ids = set(
+                                [command.source_node_id] + list(command.target_node_ids)
+                            )
+                            place_command = PlaceInstance(
+                                model_card=model_card,
+                                sharding=Sharding.Pipeline,
+                                instance_meta=InstanceMeta.MlxRing,
+                                min_nodes=len(all_node_ids),
+                            )
+                            placement = place_instance(
+                                place_command,
+                                self.state.topology,
+                                self.state.instances,
+                                self.state.node_memory,
+                                self.state.node_network,
+                                required_nodes=all_node_ids,
+                            )
+                            # Mark new instances as transfer-only
+                            for instance_id, instance in placement.items():
+                                if instance_id not in self.state.instances:
+                                    instance.shard_assignments.transfer_only = True
+                            transition_events = get_transition_events(
+                                self.state.instances, placement
+                            )
+                            generated_events.extend(transition_events)
                        case SendInputChunk(chunk=chunk):
                            generated_events.append(
                                InputChunkReceived(
@@ -383,7 +415,7 @@ class Master:
                        await self._handle_traces_collected(event)
                        continue

-                    logger.trace(f"Master indexing event: {str(event)[:100]}")
+                    logger.debug(f"Master indexing event: {str(event)[:100]}")
                    indexed = IndexedEvent(event=event, idx=len(self._event_log))
                    self.state = apply(self.state, indexed)

--- a/src/exo/master/tests/test_master.py
+++ b/src/exo/master/tests/test_master.py
@@ -42,7 +42,7 @@ from exo.utils.channels import channel
@pytest.mark.asyncio
 async def test_master():
    keypair = get_node_id_keypair()
-    node_id = NodeId(keypair.to_string())
+    node_id = NodeId(keypair.to_peer_id().to_base58())
    session_id = SessionId(master_node_id=node_id, election_clock=0)

    ge_sender, global_event_receiver = channel[ForwarderEvent]()
@@ -75,7 +75,7 @@ async def test_master():
    async with anyio.create_task_group() as tg:
        tg.start_soon(master.run)

-        sender_node_id = NodeId(f"{keypair.to_string()}_sender")
+        sender_node_id = NodeId(f"{keypair.to_peer_id().to_base58()}_sender")
        # inject a NodeGatheredInfo event
        logger.info("inject a NodeGatheredInfo event")
        await local_event_sender.send(
--- a/src/exo/routing/connection_message.py
+++ b/src/exo/routing/connection_message.py
@@ -1,9 +1,37 @@
+from enum import Enum
+
+from exo_pyo3_bindings import ConnectionUpdate, ConnectionUpdateType
+
 from exo.shared.types.common import NodeId
 from exo.utils.pydantic_ext import CamelCaseModel

 """Serialisable types for Connection Updates/Messages"""


+class ConnectionMessageType(Enum):
+    Connected = 0
+    Disconnected = 1
+
+    @staticmethod
+    def from_update_type(update_type: ConnectionUpdateType):
+        match update_type:
+            case ConnectionUpdateType.Connected:
+                return ConnectionMessageType.Connected
+            case ConnectionUpdateType.Disconnected:
+                return ConnectionMessageType.Disconnected
+
+
 class ConnectionMessage(CamelCaseModel):
    node_id: NodeId
-    expired: bool
+    connection_type: ConnectionMessageType
+    remote_ipv4: str
+    remote_tcp_port: int
+
+    @classmethod
+    def from_update(cls, update: ConnectionUpdate) -> "ConnectionMessage":
+        return cls(
+            node_id=NodeId(update.peer_id.to_base58()),
+            connection_type=ConnectionMessageType.from_update_type(update.update_type),
+            remote_ipv4=update.remote_ipv4,
+            remote_tcp_port=update.remote_tcp_port,
+        )
--- a/src/exo/routing/router.py
+++ b/src/exo/routing/router.py
@@ -1,5 +1,5 @@
 from copy import copy
-from dataclasses import dataclass, field
+from itertools import count
 from math import inf
 from os import PathLike
 from pathlib import Path
@@ -14,14 +14,15 @@ from anyio import (
 )
 from anyio.abc import TaskGroup
 from exo_pyo3_bindings import (
+    AllQueuesFullError,
    Keypair,
-    PyPeer,
+    NetworkingHandle,
+    NoPeersSubscribedToTopicError,
 )
 from filelock import FileLock
 from loguru import logger

 from exo.shared.constants import EXO_NODE_ID_KEYPAIR
-from exo.shared.types.common import NodeId
 from exo.utils.channels import Receiver, Sender, channel
 from exo.utils.pydantic_ext import CamelCaseModel

@@ -98,32 +99,28 @@ class TopicRouter[T: CamelCaseModel]:
        )


-@dataclass
 class Router:
-    _peer: PyPeer
-    topic_routers: dict[str, TopicRouter[CamelCaseModel]] = field(
-        init=False, default_factory=dict
-    )
-    networking_receiver: Receiver[tuple[str, bytes]] = field(init=False)
-    _tmp_networking_sender: Sender[tuple[str, bytes]] | None = field(init=False)
-    _tg: TaskGroup | None = None
-
-    def __post_init__(self):
-        self._tmp_networking_sender, self.networking_receiver = channel()
-
    @classmethod
-    def create(cls, identity: Keypair, namespace: str) -> "Router":
-        return cls(_peer=PyPeer.new(identity, namespace))
+    def create(cls, identity: Keypair) -> "Router":
+        return cls(handle=NetworkingHandle(identity))
+
+    def __init__(self, handle: NetworkingHandle):
+        self.topic_routers: dict[str, TopicRouter[CamelCaseModel]] = {}
+        send, recv = channel[tuple[str, bytes]]()
+        self.networking_receiver: Receiver[tuple[str, bytes]] = recv
+        self._net: NetworkingHandle = handle
+        self._tmp_networking_sender: Sender[tuple[str, bytes]] | None = send
+        self._id_count = count()
+        self._tg: TaskGroup | None = None

    async def register_topic[T: CamelCaseModel](self, topic: TypedTopic[T]):
+        assert self._tg is None, "Attempted to register topic after setup time"
        send = self._tmp_networking_sender
        if send:
            self._tmp_networking_sender = None
        else:
            send = self.networking_receiver.clone_sender()
        router = TopicRouter[T](topic, send)
-        if self._tg is not None:
-            self._tg.start_soon(router.run)
        self.topic_routers[topic.topic] = cast(TopicRouter[CamelCaseModel], router)
        await self._networking_subscribe(str(topic.topic))

@@ -151,18 +148,14 @@ class Router:
    async def run(self):
        logger.debug("Starting Router")
        try:
-
-            async def _peer_run():
-                await self._peer.run()
-
            async with create_task_group() as tg:
                self._tg = tg
                for topic in self.topic_routers:
                    router = self.topic_routers[topic]
                    tg.start_soon(router.run)
                tg.start_soon(self._networking_recv)
+                tg.start_soon(self._networking_recv_connection_messages)
                tg.start_soon(self._networking_publish)
-                tg.start_soon(_peer_run)
                # Router only shuts down if you cancel it.
                await sleep_forever()
        finally:
@@ -177,58 +170,47 @@ class Router:
        self._tg.cancel_scope.cancel()

    async def _networking_subscribe(self, topic: str):
-        await self._peer.subscribe(topic)
+        await self._net.gossipsub_subscribe(topic)
        logger.info(f"Subscribed to {topic}")

    async def _networking_unsubscribe(self, topic: str):
-        await self._peer.unsubscribe(topic)
+        await self._net.gossipsub_unsubscribe(topic)
        logger.info(f"Unsubscribed from {topic}")

    async def _networking_recv(self):
        while True:
-            try:
-                swarm_event = await self._peer.recv()
-            except ValueError:
-                logger.error("Message too large for gossipsub, dropped")
-                continue
-            except ConnectionError:
-                logger.error("All peer queues full, network overloaded")
-                continue
-            except RuntimeError:
-                break
-
-            cm = None
-            if (peer_id := swarm_event.downcast_discovered()) is not None:
-                cm = ConnectionMessage(node_id=NodeId(peer_id), expired=False)
-            if (peer_id := swarm_event.downcast_expired()) is not None:
-                cm = ConnectionMessage(node_id=NodeId(peer_id), expired=True)
-
-            if cm is not None:
-                if CONNECTION_MESSAGES.topic in self.topic_routers:
-                    router = self.topic_routers[CONNECTION_MESSAGES.topic]
-                    assert router.topic.model_type == ConnectionMessage
-                    router = cast(TopicRouter[ConnectionMessage], router)
-                    await router.publish(cm)
-                continue
-
-            assert (msg := swarm_event.downcast_message()) is not None
-            _origin, topic, payload = msg
-            logger.debug(f"Received message on {topic} with payload {payload}")
+            topic, data = await self._net.gossipsub_recv()
+            logger.trace(f"Received message on {topic} with payload {data}")
            if topic not in self.topic_routers:
                logger.warning(f"Received message on unknown or inactive topic {topic}")
                continue

            router = self.topic_routers[topic]
-            await router.publish_bytes(payload)
+            await router.publish_bytes(data)
+
+    async def _networking_recv_connection_messages(self):
+        while True:
+            update = await self._net.connection_update_recv()
+            message = ConnectionMessage.from_update(update)
+            logger.trace(
+                f"Received message on connection_messages with payload {message}"
+            )
+            if CONNECTION_MESSAGES.topic in self.topic_routers:
+                router = self.topic_routers[CONNECTION_MESSAGES.topic]
+                assert router.topic.model_type == ConnectionMessage
+                router = cast(TopicRouter[ConnectionMessage], router)
+                await router.publish(message)

    async def _networking_publish(self):
        with self.networking_receiver as networked_items:
            async for topic, data in networked_items:
                try:
                    logger.trace(f"Sending message on {topic} with payload {data}")
-                    await self._peer.send(topic, data)
-                except RuntimeError:
-                    break
+                    await self._net.gossipsub_publish(topic, data)
+                except NoPeersSubscribedToTopicError:
+                    pass
+                except AllQueuesFullError:
+                    logger.warning(f"All peer queues full, dropping message on {topic}")


 def get_node_id_keypair(
@@ -239,7 +221,7 @@ def get_node_id_keypair(
    Obtain the :class:`PeerId` by from it.
    """
    # TODO(evan): bring back node id persistence once we figure out how to deal with duplicates
-    return Keypair.generate()
+    return Keypair.generate_ed25519()

    def lock_path(path: str | bytes | PathLike[str] | PathLike[bytes]) -> Path:
        return Path(str(path) + ".lock")
--- a/src/exo/shared/tests/test_election.py
+++ b/src/exo/shared/tests/test_election.py
@@ -1,7 +1,7 @@
 import pytest
 from anyio import create_task_group, fail_after, move_on_after

-from exo.routing.connection_message import ConnectionMessage
+from exo.routing.connection_message import ConnectionMessage, ConnectionMessageType
 from exo.shared.election import Election, ElectionMessage, ElectionResult
 from exo.shared.types.commands import ForwarderCommand, TestCommand
 from exo.shared.types.common import NodeId, SessionId
@@ -330,7 +330,9 @@ async def test_connection_message_triggers_new_round_broadcast() -> None:
            await cm_tx.send(
                ConnectionMessage(
                    node_id=NodeId(),
-                    expired=False,
+                    connection_type=ConnectionMessageType.Connected,
+                    remote_ipv4="",
+                    remote_tcp_port=0,
                )
            )

--- a/src/exo/shared/types/api.py
+++ b/src/exo/shared/types/api.py
@@ -373,6 +373,15 @@ class DeleteDownloadResponse(CamelCaseModel):
    command_id: CommandId


+class DistributeModelParams(CamelCaseModel):
+    target_node_ids: list[NodeId] | None = None  # None = all connected nodes
+
+
+class DistributeModelResponse(CamelCaseModel):
+    command_id: CommandId
+    message: str
+
+
 class TraceEventResponse(CamelCaseModel):
    name: str
    start_us: int
--- a/src/exo/shared/types/commands.py
+++ b/src/exo/shared/types/commands.py
@@ -77,6 +77,14 @@ class CancelDownload(BaseCommand):
    model_id: ModelId


+class DistributeModel(BaseCommand):
+    """Distribute model files from one node to others via MLX distributed."""
+
+    model_id: ModelId
+    source_node_id: NodeId
+    target_node_ids: list[NodeId]
+
+
 DownloadCommand = StartDownload | DeleteDownload | CancelDownload


@@ -91,6 +99,7 @@ Command = (
    | DeleteInstance
    | TaskFinished
    | SendInputChunk
+    | DistributeModel
 )


--- a/src/exo/shared/types/tasks.py
+++ b/src/exo/shared/types/tasks.py
@@ -41,7 +41,7 @@ class DownloadModel(BaseTask):  # emitted by Worker


 class LoadModel(BaseTask):  # emitted by Worker
-    pass
+    has_local_model: bool = Field(default=True)


 class ConnectToGroup(BaseTask):  # emitted by Worker
@@ -76,6 +76,13 @@ class ImageEdits(BaseTask):  # emitted by Master
    error_message: str | None = Field(default=None)


+class TransferModelToDisk(BaseTask):  # emitted by Worker
+    """Transfer all model files from source to receivers' disk via MLX distributed."""
+
+    shard_metadata: ShardMetadata
+    has_local_model: bool = Field(default=True)
+
+
 class Shutdown(BaseTask):  # emitted by Worker
    runner_id: RunnerId

@@ -85,6 +92,7 @@ Task = (
    | DownloadModel
    | ConnectToGroup
    | LoadModel
+    | TransferModelToDisk
    | StartWarmup
    | TextGeneration
    | ImageGeneration
--- a/src/exo/shared/types/worker/runners.py
+++ b/src/exo/shared/types/worker/runners.py
@@ -84,6 +84,7 @@ class ShardAssignments(CamelCaseModel):
    model_id: ModelId
    runner_to_shard: Mapping[RunnerId, ShardMetadata]
    node_to_runner: Mapping[NodeId, RunnerId]
+    transfer_only: bool = False

    @model_validator(mode="after")
    def validate_runners_exist(self) -> "ShardAssignments":
--- a/src/exo/worker/engines/mlx/auto_parallel.py
+++ b/src/exo/worker/engines/mlx/auto_parallel.py
@@ -44,6 +44,7 @@ if TYPE_CHECKING:
    from mlx_lm.models.cache import Cache

 TimeoutCallback = Callable[[], None]
+WeightLoader = Callable[[nn.Module, int], None] | None


 def eval_with_timeout(
@@ -330,6 +331,7 @@ def tensor_auto_parallel(
    group: mx.distributed.Group,
    timeout_seconds: float = 60.0,
    on_timeout: TimeoutCallback | None = None,
+    weight_loader: WeightLoader = None,
 ) -> nn.Module:
    all_to_sharded_linear = partial(
        shard_linear,
@@ -431,7 +433,7 @@ def tensor_auto_parallel(
        raise ValueError(f"Unsupported model type: {type(model)}")

    model = tensor_parallel_sharding_strategy.shard_model(
-        model, timeout_seconds, on_timeout
+        model, timeout_seconds, on_timeout, weight_loader
    )
    return patch_tensor_model(model)

@@ -458,6 +460,7 @@ class TensorParallelShardingStrategy(ABC):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module: ...


@@ -467,9 +470,12 @@ class LlamaShardingStrategy(TensorParallelShardingStrategy):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module:
        model = cast(LlamaModel, model)
-        for layer in model.layers:
+        for i, layer in enumerate(model.layers):
+            if weight_loader is not None:
+                weight_loader(model, i)
            # Force load weights before sharding to avoid FAST_SYNCH deadlock
            eval_with_timeout(
                layer.parameters(), timeout_seconds / len(model.layers), on_timeout
@@ -521,9 +527,12 @@ class DeepSeekShardingStrategy(TensorParallelShardingStrategy):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module:
        model = cast(DeepseekV3Model, model)
-        for layer in model.layers:
+        for i, layer in enumerate(model.layers):
+            if weight_loader is not None:
+                weight_loader(model, i)
            eval_with_timeout(
                layer.parameters(), timeout_seconds / len(model.layers), on_timeout
            )
@@ -596,9 +605,12 @@ class GLM4MoeLiteShardingStrategy(TensorParallelShardingStrategy):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module:
        model = cast(GLM4MoeLiteModel, model)
-        for layer in model.layers:  # type: ignore
+        for i, layer in enumerate(model.layers):  # type: ignore
+            if weight_loader is not None:
+                weight_loader(model, i)
            layer = cast(Glm4MoeLiteDecoderLayer, layer)
            eval_with_timeout(
                layer.parameters(),
@@ -738,9 +750,12 @@ class MiniMaxShardingStrategy(TensorParallelShardingStrategy):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module:
        model = cast(MiniMaxModel, model)
-        for layer in model.layers:
+        for i, layer in enumerate(model.layers):
+            if weight_loader is not None:
+                weight_loader(model, i)
            eval_with_timeout(
                layer.parameters(), timeout_seconds / len(model.layers), on_timeout
            )
@@ -778,9 +793,12 @@ class QwenShardingStrategy(TensorParallelShardingStrategy):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module:
        model = cast(Qwen3MoeModel | Qwen3NextModel, model)
-        for layer in model.layers:
+        for i, layer in enumerate(model.layers):
+            if weight_loader is not None:
+                weight_loader(model, i)
            eval_with_timeout(
                layer.parameters(), timeout_seconds / len(model.layers), on_timeout
            )
@@ -902,9 +920,12 @@ class Glm4MoeShardingStrategy(TensorParallelShardingStrategy):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module:
        model = cast(Glm4MoeModel, model)
-        for layer in model.layers:
+        for i, layer in enumerate(model.layers):
+            if weight_loader is not None:
+                weight_loader(model, i)
            eval_with_timeout(
                layer.parameters(), timeout_seconds / len(model.layers), on_timeout
            )
@@ -948,10 +969,13 @@ class GptOssShardingStrategy(TensorParallelShardingStrategy):
        model: nn.Module,
        timeout_seconds: float,
        on_timeout: TimeoutCallback | None,
+        weight_loader: WeightLoader = None,
    ) -> nn.Module:
        model = cast(GptOssMoeModel, model)

-        for layer in model.layers:
+        for i, layer in enumerate(model.layers):
+            if weight_loader is not None:
+                weight_loader(model, i)
            eval_with_timeout(
                layer.parameters(), timeout_seconds / len(model.layers), on_timeout
            )
--- a/src/exo/worker/engines/mlx/model_transfer.py
+++ b/src/exo/worker/engines/mlx/model_transfer.py
@@ -0,0 +1,499 @@
+"""
+Model transfer via MLX distributed all_sum.
+
+Three transfer modes:
+1. Metadata file transfer: broadcast small files (config.json, tokenizer, etc.) to disk
+2. Weight tensor broadcast: stream weight tensors directly into memory via all_sum
+3. Full file transfer: broadcast all files (including safetensors) to disk
+
+All functions are collective operations — every rank in the group must call them.
+
+Protocol relies on all_sum: source has real data, receivers have zeros.
+all_sum(source + zeros) = source data on all ranks.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import re
+import shutil
+import tempfile
+from functools import partial
+from pathlib import Path
+from typing import Any, Final, cast
+
+import mlx.core as mx
+
+from exo.shared.constants import EXO_MODELS_DIR
+from exo.shared.models.model_cards import ModelId
+from exo.worker.runner.bootstrap import logger
+
+Group = mx.distributed.Group
+
+CHUNK_SIZE: Final[int] = 100 * 1024 * 1024  # 100 MB
+_LAYER_RE: Final[re.Pattern[str]] = re.compile(r"(?:^|\.)(layers|h)\.(\d+)\.")
+
+
+def _all_sum_cpu(x: mx.array, group: Group) -> mx.array:
+    """all_sum on CPU stream to avoid GPU memory pressure."""
+    return mx.distributed.all_sum(
+        x, stream=mx.default_stream(mx.Device(mx.cpu)), group=group
+    )
+
+
+def _is_metadata_file(filename: str) -> bool:
+    """A metadata file is anything that isn't a weight file (.safetensors)."""
+    return not filename.endswith(".safetensors")
+
+
+def model_path_for_id(model_id: ModelId) -> Path:
+    """Get model path without requiring directory to exist (unlike build_model_path)."""
+    return EXO_MODELS_DIR / model_id.normalize()
+
+
+def coordinate_transfer(group: Group, has_local_model: bool) -> tuple[bool, int]:
+    """
+    Determine if a transfer is needed and which rank is the source.
+
+    All ranks must call this function (uses collective all_sum).
+
+    Returns:
+        (needs_transfer, source_rank) — source_rank is the lowest rank
+        that has the model. needs_transfer is True if any rank is missing it.
+    """
+    all_sum = partial(_all_sum_cpu, group=group)
+    world_size = group.size()
+
+    # Each rank broadcasts a one-hot vector at its position if it has the model
+    bitmask = mx.zeros(world_size, dtype=mx.int32)
+    if has_local_model:
+        bitmask = bitmask.at[group.rank()].add(1)
+    summed = all_sum(bitmask)
+    mx.eval(summed)
+
+    has_model_flags: list[int] = summed.tolist()  # type: ignore[assignment]
+    total_have = sum(has_model_flags)
+
+    if total_have == 0:
+        raise RuntimeError(
+            "No rank has the model files — cannot transfer. "
+            "At least one node must have downloaded the model."
+        )
+
+    if total_have == world_size:
+        logger.info("All ranks have model files, no transfer needed")
+        return False, 0
+
+    source_rank = next(i for i, flag in enumerate(has_model_flags) if flag > 0)
+    logger.info(
+        f"Transfer needed: source_rank={source_rank}, "
+        f"{total_have}/{world_size} ranks have model"
+    )
+    return True, source_rank
+
+
+def _broadcast_json(obj: object, group: Group, is_source: bool) -> object:
+    """Broadcast a JSON-serializable object from source to all ranks."""
+    all_sum = partial(_all_sum_cpu, group=group)
+
+    data = json.dumps(obj, separators=(",", ":")).encode("utf-8") if is_source else b""
+
+    # Broadcast length
+    len_arr = mx.array([len(data) if is_source else 0], dtype=mx.int64)
+    len_result = all_sum(len_arr)
+    mx.eval(len_result)
+    length = int(len_result.item())
+    if length == 0:
+        return None
+
+    # Broadcast payload
+    if is_source:
+        arr = mx.array(list(data), dtype=mx.uint8)
+    else:
+        arr = mx.zeros(length, dtype=mx.uint8)
+    result = all_sum(arr)
+    mx.eval(result)
+    return json.loads(bytes(cast(list[int], result.tolist())))  # pyright: ignore[reportAny]
+
+
+def _build_manifest(
+    model_path: Path, metadata_only: bool = False
+) -> list[dict[str, str | int]]:
+    """Build a list of files in the model directory with their relative paths and sizes."""
+    manifest: list[dict[str, str | int]] = []
+    for root, _dirs, files in os.walk(model_path):
+        for fname in sorted(files):
+            if metadata_only and not _is_metadata_file(fname):
+                continue
+            full_path = Path(root) / fname
+            rel_path = str(full_path.relative_to(model_path))
+            manifest.append(
+                {
+                    "path": rel_path,
+                    "size": full_path.stat().st_size,
+                }
+            )
+    return manifest
+
+
+def _transfer_file_to_disk(
+    source_path: Path,
+    rel_path: str,
+    file_size: int,
+    group: Group,
+    is_source: bool,
+    dest_path: Path,
+) -> None:
+    """Transfer a single file chunk-by-chunk via all_sum. Source reads from disk, receivers write to dest_path."""
+    all_sum = partial(_all_sum_cpu, group=group)
+
+    if is_source:
+        src_file = source_path / rel_path
+        with open(src_file, "rb") as f:
+            offset = 0
+            while offset < file_size:
+                chunk_bytes = min(CHUNK_SIZE, file_size - offset)
+                data = f.read(chunk_bytes)
+                if not data:
+                    break
+                size_arr = mx.array([len(data)], dtype=mx.int64)
+                mx.eval(all_sum(size_arr))
+                chunk_arr = mx.array(list(data), dtype=mx.uint8)
+                result = all_sum(chunk_arr)
+                mx.eval(result)
+                offset += len(data)
+            # Signal end of file
+            mx.eval(all_sum(mx.array([0], dtype=mx.int64)))
+    else:
+        dst_file = dest_path / rel_path
+        os.makedirs(dst_file.parent, exist_ok=True)
+        with open(dst_file, "wb") as f:
+            while True:
+                size_arr = all_sum(mx.zeros(1, dtype=mx.int64))
+                mx.eval(size_arr)
+                chunk_size = int(size_arr.item())
+                if chunk_size == 0:
+                    break
+                chunk_data = all_sum(mx.zeros(chunk_size, dtype=mx.uint8))
+                mx.eval(chunk_data)
+                f.write(bytes(cast(list[int], chunk_data.tolist())))
+
+
+def _transfer_files_to_disk(
+    model_path: Path,
+    group: Group,
+    is_source: bool,
+    metadata_only: bool = False,
+) -> None:
+    """
+    Transfer files from source to all receivers' disk.
+
+    Source broadcasts a manifest then each file. Receivers write to a temp dir
+    then atomically move files to model_path.
+    """
+    if is_source:
+        source_manifest = _build_manifest(model_path, metadata_only=metadata_only)
+    else:
+        source_manifest = []
+    manifest = cast(
+        list[dict[str, str | int]],
+        _broadcast_json(source_manifest if is_source else None, group, is_source),
+    )
+
+    if not manifest:
+        logger.info("No files to transfer")
+        return
+
+    logger.info(
+        f"Transferring {len(manifest)} files ({'metadata only' if metadata_only else 'all'})"
+    )
+
+    temp_dir: Path | None = None
+    if not is_source:
+        os.makedirs(model_path.parent, exist_ok=True)
+        temp_dir = Path(
+            tempfile.mkdtemp(
+                dir=model_path.parent,
+                prefix=f".transfer_{model_path.name}_",
+            )
+        )
+
+    try:
+        for entry in manifest:
+            rel_path = str(entry["path"])
+            file_size = int(entry["size"])
+            logger.info(f"  {rel_path} ({file_size} bytes)")
+            _transfer_file_to_disk(
+                source_path=model_path,
+                rel_path=rel_path,
+                file_size=file_size,
+                group=group,
+                is_source=is_source,
+                dest_path=temp_dir if temp_dir is not None else model_path,
+            )
+
+        if temp_dir is not None:
+            os.makedirs(model_path, exist_ok=True)
+            for entry in manifest:
+                rel_path = str(entry["path"])
+                src = temp_dir / rel_path
+                dst = model_path / rel_path
+                os.makedirs(dst.parent, exist_ok=True)
+                os.replace(src, dst)
+            logger.info(
+                f"Transfer complete: {len(manifest)} files moved to {model_path}"
+            )
+    finally:
+        if temp_dir is not None and temp_dir.exists():
+            shutil.rmtree(temp_dir, ignore_errors=True)
+
+
+def transfer_metadata_files(model_path: Path, group: Group, is_source: bool) -> None:
+    """
+    Transfer metadata files (config.json, tokenizer files, etc.) to receivers' disk.
+
+    All ranks must call this function (collective operation).
+    Only the designated source (is_source=True) should send; all others receive.
+    """
+    _transfer_files_to_disk(model_path, group, is_source=is_source, metadata_only=True)
+
+
+def transfer_all_files(model_path: Path, group: Group, is_source: bool) -> None:
+    """
+    Transfer ALL model files (including safetensors) to receivers' disk.
+
+    All ranks must call this function (collective operation).
+    Only the designated source (is_source=True) should send; all others receive.
+    """
+    _transfer_files_to_disk(model_path, group, is_source=is_source, metadata_only=False)
+
+
+def _parse_mx_dtype(dtype_str: str) -> mx.Dtype:
+    """Convert a dtype string like 'float16' or 'mlx.core.float16' to mx.Dtype."""
+    name = dtype_str.split(".")[-1]
+    dtype = getattr(mx, name, None)
+    if dtype is None:
+        raise ValueError(f"Unknown MLX dtype: {dtype_str}")
+    return dtype  # type: ignore[return-value]
+
+
+def _extract_layer_index(name: str) -> int | None:
+    """Extract layer index from a weight name, or None for non-layer weights.
+
+    Matches patterns like ``model.layers.5.self_attn.q_proj.weight``
+    or ``transformer.h.12.mlp.gate_proj.scales``.
+    """
+    m = _LAYER_RE.search(name)
+    return int(m.group(2)) if m else None
+
+
+class WeightBroadcastState:
+    """Holds state for layer-by-layer weight broadcasting.
+
+    Created by :func:`prepare_weight_broadcast`.  Callers stream weights
+    incrementally via :meth:`broadcast_non_layer_weights` and
+    :meth:`broadcast_layer` so that at most one layer's worth of un-sharded
+    weight data is resident at a time.
+    """
+
+    def __init__(
+        self,
+        meta: dict[str, dict[str, Any]],
+        source_weights: dict[str, mx.array] | None,
+        group: Group,
+        is_source: bool,
+    ) -> None:
+        self.meta = meta
+        self.source_weights = source_weights
+        self.group = group
+        self.is_source = is_source
+
+        # Partition weight names into layer vs. non-layer
+        self.layer_names: dict[int, list[str]] = {}
+        self.non_layer_names: list[str] = []
+        for name in sorted(meta.keys()):
+            layer_idx = _extract_layer_index(name)
+            if layer_idx is not None:
+                self.layer_names.setdefault(layer_idx, []).append(name)
+            else:
+                self.non_layer_names.append(name)
+
+        logger.info(
+            f"WeightBroadcastState: {len(self.non_layer_names)} non-layer weights, "
+            f"{len(self.layer_names)} layers"
+        )
+
+    # ------------------------------------------------------------------
+    # Internal helpers
+    # ------------------------------------------------------------------
+
+    def _broadcast_names(self, names: list[str]) -> dict[str, mx.array]:
+        """Broadcast a specific set of weight tensors by name."""
+        all_sum = partial(_all_sum_cpu, group=self.group)
+        result: dict[str, mx.array] = {}
+        for name in names:
+            info = self.meta[name]
+            shape = cast(list[int], info["s"])
+            dtype = _parse_mx_dtype(cast(str, info["d"]))
+
+            if self.is_source:
+                assert self.source_weights is not None
+                tensor = self.source_weights.pop(name)
+                mx.eval(tensor)  # loads from disk (lazy)
+            else:
+                tensor = mx.zeros(shape, dtype=dtype)
+
+            broadcasted = all_sum(tensor)
+            mx.eval(broadcasted)
+            result[name] = broadcasted
+        return result
+
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+
+    def broadcast_non_layer_weights(self) -> dict[str, mx.array]:
+        """Broadcast non-layer weights (embeddings, norms, lm_head)."""
+        if not self.non_layer_names:
+            return {}
+        logger.info(
+            f"Broadcasting {len(self.non_layer_names)} non-layer weight tensors"
+        )
+        return self._broadcast_names(self.non_layer_names)
+
+    def broadcast_layer(self, layer_idx: int) -> dict[str, mx.array]:
+        """Broadcast weights for a single transformer layer."""
+        names = self.layer_names.get(layer_idx, [])
+        if not names:
+            return {}
+        return self._broadcast_names(names)
+
+
+def prepare_weight_broadcast(
+    model_path: Path,
+    group: Group,
+    is_source: bool,
+) -> WeightBroadcastState:
+    """Prepare for layer-by-layer weight broadcasting.
+
+    Source loads safetensors lazily and broadcasts weight metadata (names,
+    shapes, dtypes) as JSON.  Returns a :class:`WeightBroadcastState` that
+    can then stream weights incrementally via ``broadcast_layer()``.
+
+    All ranks must call this function (collective operation).
+    """
+    source_weights: dict[str, mx.array] | None = None
+    if is_source:
+        source_weights = {}
+        weight_files = sorted(model_path.glob("*.safetensors"))
+        if not weight_files:
+            weight_files = sorted(model_path.glob("**/*.safetensors"))
+        for wf in weight_files:
+            try:
+                loaded = cast(
+                    dict[str, mx.array],
+                    mx.load(str(wf), lazy=True),  # pyright: ignore[reportCallIssue]
+                )
+            except TypeError:
+                loaded = cast(dict[str, mx.array], mx.load(str(wf)))
+            source_weights.update(loaded)
+        logger.info(
+            f"Source loaded {len(source_weights)} weight tensors (lazy) "
+            f"from {len(weight_files)} files"
+        )
+
+    # Broadcast metadata
+    if is_source and source_weights is not None:
+        source_meta: dict[str, dict[str, Any]] = {
+            name: {"s": list(tensor.shape), "d": str(tensor.dtype)}
+            for name, tensor in source_weights.items()
+        }
+    else:
+        source_meta = {}
+
+    meta = cast(
+        dict[str, dict[str, Any]],
+        _broadcast_json(source_meta if is_source else None, group, is_source),
+    )
+
+    logger.info(f"Weight broadcast prepared: {len(meta)} tensors")
+    return WeightBroadcastState(meta, source_weights, group, is_source)
+
+
+def broadcast_model_weights(
+    model_path: Path,
+    group: Group,
+    is_source: bool,
+) -> dict[str, mx.array]:
+    """
+    Broadcast model weight tensors from source rank to all receivers' memory.
+
+    Source loads weights from .safetensors files on disk and broadcasts each
+    tensor via all_sum. Receivers receive tensors directly as mx.arrays in
+    memory — no disk write for weight data.
+
+    All ranks must call this function (collective operation).
+    Only the designated source (is_source=True) should send; all others receive.
+
+    Returns:
+        dict mapping weight names to mx.arrays (on all ranks).
+    """
+    all_sum = partial(_all_sum_cpu, group=group)
+
+    # Source loads weights (lazy if supported, so only one tensor in memory at a time)
+    weights: dict[str, mx.array] = {}
+    if is_source:
+        weight_files = sorted(model_path.glob("*.safetensors"))
+        if not weight_files:
+            weight_files = sorted(model_path.glob("**/*.safetensors"))
+        for wf in weight_files:
+            try:
+                loaded = cast(dict[str, mx.array], mx.load(str(wf), lazy=True))  # pyright: ignore[reportCallIssue]
+            except TypeError:
+                loaded = cast(dict[str, mx.array], mx.load(str(wf)))
+            weights.update(loaded)
+        logger.info(
+            f"Source loaded {len(weights)} weight tensors from {len(weight_files)} files"
+        )
+
+    # Broadcast weight metadata: {name: {shape, dtype}}
+    if is_source:
+        source_meta: dict[str, dict[str, Any]] = {
+            name: {"s": list(tensor.shape), "d": str(tensor.dtype)}
+            for name, tensor in weights.items()
+        }
+    else:
+        source_meta = {}
+    meta = cast(
+        dict[str, dict[str, Any]],
+        _broadcast_json(source_meta if is_source else None, group, is_source),
+    )
+
+    logger.info(f"Broadcasting {len(meta)} weight tensors")
+
+    # Broadcast each tensor in sorted order (deterministic across ranks).
+    # Source loads one tensor at a time from disk (lazy), broadcasts it,
+    # then drops the reference so only one tensor is in flight at a time.
+    result: dict[str, mx.array] = {}
+    for i, name in enumerate(sorted(meta.keys())):
+        info = meta[name]
+        shape = cast(list[int], info["s"])
+        dtype_str = cast(str, info["d"])
+        dtype = _parse_mx_dtype(dtype_str)
+
+        if is_source:
+            tensor = weights.pop(name)  # pop to free lazy ref after broadcast
+            mx.eval(tensor)  # loads from disk
+        else:
+            tensor = mx.zeros(shape, dtype=dtype)
+
+        broadcasted = all_sum(tensor)
+        mx.eval(broadcasted)
+        result[name] = broadcasted
+
+        if (i + 1) % 100 == 0:
+            logger.info(f"  Broadcast {i + 1}/{len(meta)} tensors")
+
+    logger.info(f"Weight broadcast complete: {len(result)} tensors")
+    return result
--- a/src/exo/worker/engines/mlx/utils_mlx.py
+++ b/src/exo/worker/engines/mlx/utils_mlx.py
@@ -2,6 +2,7 @@ import json
 import os
 import sys
 import time
+from collections.abc import Callable
 from pathlib import Path
 from typing import Any, cast

@@ -59,6 +60,13 @@ from exo.worker.engines.mlx.auto_parallel import (
    pipeline_auto_parallel,
    tensor_auto_parallel,
 )
+from exo.worker.engines.mlx.model_transfer import (
+    WeightBroadcastState,
+    coordinate_transfer,
+    model_path_for_id,
+    prepare_weight_broadcast,
+    transfer_metadata_files,
+)
 from exo.worker.runner.bootstrap import logger

 Group = mx.distributed.Group
@@ -197,6 +205,7 @@ def load_mlx_items(
    bound_instance: BoundInstance,
    group: Group | None,
    on_timeout: TimeoutCallback | None = None,
+    has_local_model: bool = True,
 ) -> tuple[Model, TokenizerWrapper]:
    if group is None:
        logger.info(f"Single device used for {bound_instance.instance}")
@@ -211,7 +220,10 @@ def load_mlx_items(
        logger.info("Starting distributed init")
        start_time = time.perf_counter()
        model, tokenizer = shard_and_load(
-            bound_instance.bound_shard, group=group, on_timeout=on_timeout
+            bound_instance.bound_shard,
+            group=group,
+            on_timeout=on_timeout,
+            has_local_model=has_local_model,
        )
        end_time = time.perf_counter()
        logger.info(
@@ -227,30 +239,69 @@ def shard_and_load(
    shard_metadata: ShardMetadata,
    group: Group,
    on_timeout: TimeoutCallback | None = None,
+    has_local_model: bool = True,
 ) -> tuple[nn.Module, TokenizerWrapper]:
-    model_path = build_model_path(shard_metadata.model_card.model_id)
+    model_id = shard_metadata.model_card.model_id
+    model_path = model_path_for_id(model_id)

-    model, _ = load_model(model_path, lazy=True, strict=False)
+    # Coordinate: does any rank need a transfer?
+    needs_transfer, source_rank = coordinate_transfer(group, has_local_model)
+    is_source = group.rank() == source_rank
+
+    # Step 1: Always ensure all nodes have metadata files (config, tokenizer, etc.).
+    # This is cheap (~20MB, ~1s) and guarantees config.json is present for load_model().
+    transfer_metadata_files(model_path, group, is_source)
+
+    # Step 2: Only broadcast weights if some rank is missing the model
+    broadcast_state: WeightBroadcastState | None = None
+    if needs_transfer:
+        logger.info(
+            f"Model transfer needed (source_rank={source_rank}, "
+            f"is_source={is_source}, local_weights={has_local_model})"
+        )
+        broadcast_state = prepare_weight_broadcast(model_path, group, is_source)
+
+    # Create model architecture (all ranks have config.json on disk now).
+    # Always use lazy=True when we have broadcast state: load_model's internal
+    # nn.quantize skips quantization when weights dict is empty (no safetensors),
+    # leaving the model un-quantized. lazy=False would then mx.eval() the full
+    # fp16 model (~72GB for a 36B-param model), causing OOM on the receiver.
+    # We handle quantization ourselves below before loading broadcast weights.
+    use_lazy = has_local_model or broadcast_state is not None
+    model, _ = load_model(model_path, lazy=use_lazy, strict=False)
    logger.debug(model)
    if hasattr(model, "model") and isinstance(model.model, DeepseekV3Model):  # type: ignore
        pass
        # TODO: See if we should quantize the model.
-        # def is_attention_layer(path: str) -> bool:
-        #     path = path.lower()
-
-        #     return "self_attn" in path and "layernorm" not in path
-
-        # def quant_predicate(path: str, module: nn.Module):
-        #     if not isinstance(module, nn.Linear):
-        #         return False
-
-        #     return is_attention_layer(path)
-        # model, config = quantize_model(
-        #        model, config, group_size=KV_GROUP_SIZE, bits=ATTENTION_KV_BITS, quant_predicate=quant_predicate, mode=QUANTIZE_MODEL_MODE
-        #    )

    assert isinstance(model, nn.Module)

+    if broadcast_state is not None:
+        # When receiver has no weight files, load_model skips quantization.
+        # Apply it explicitly so QuantizedLinear layers match broadcast weight shapes.
+        if not has_local_model:
+            config_path = model_path / "config.json"
+            with open(config_path) as f:
+                config = json.load(f)  # pyright: ignore[reportAny]
+            quant_config: dict[str, int] | None = config.get(  # pyright: ignore[reportAny]
+                "quantization", None
+            )
+            if quant_config is not None:
+                logger.info(f"Applying quantization to receiver model: {quant_config}")
+                nn.quantize(  # pyright: ignore[reportUnknownMemberType]
+                    model,
+                    group_size=quant_config.get("group_size", 64),
+                    bits=quant_config.get("bits", 4),
+                )
+
+        # Broadcast and load non-layer weights (embeddings, norms, lm_head) upfront.
+        # These are small (~600MB) and needed before the sharding loop.
+        non_layer_weights = broadcast_state.broadcast_non_layer_weights()
+        if non_layer_weights:
+            model.load_weights(list(non_layer_weights.items()), strict=False)
+            logger.info(f"Loaded {len(non_layer_weights)} non-layer weight tensors")
+        del non_layer_weights
+
    tokenizer = get_tokenizer(model_path, shard_metadata)

    logger.info(f"Group size: {group.size()}, group rank: {group.rank()}")
@@ -264,12 +315,43 @@ def shard_and_load(
        f"(model size: {model_size_gb:.1f}GB)"
    )

+    # Build per-layer weight loader for streaming broadcast during sharding.
+    # Each layer's weights are broadcast via all_sum just before that layer is
+    # sharded, so at most one un-sharded layer is in memory at a time.
+    weight_loader_fn: Callable[[nn.Module, int], None] | None = None
+    if broadcast_state is not None:
+        _state = broadcast_state  # capture for closure
+
+        def _load_layer_weights(mdl: nn.Module, layer_idx: int) -> None:
+            layer_weights = _state.broadcast_layer(layer_idx)
+            if layer_weights:
+                mdl.load_weights(list(layer_weights.items()), strict=False)
+
+        weight_loader_fn = _load_layer_weights
+
    match shard_metadata:
        case TensorShardMetadata():
            logger.info(f"loading model from {model_path} with tensor parallelism")
-            model = tensor_auto_parallel(model, group, timeout_seconds, on_timeout)
+            model = tensor_auto_parallel(
+                model, group, timeout_seconds, on_timeout, weight_loader_fn
+            )
        case PipelineShardMetadata():
            logger.info(f"loading model from {model_path} with pipeline parallelism")
+            # Broadcast all layers (all_sum is collective — all ranks must
+            # participate) but only load weights for layers this node will
+            # keep after pipeline slicing.  Out-of-range results are discarded,
+            # keeping peak memory proportional to this node's layer count.
+            if broadcast_state is not None:
+                for layer_idx in sorted(broadcast_state.layer_names.keys()):
+                    layer_weights = broadcast_state.broadcast_layer(layer_idx)
+                    if (
+                        shard_metadata.start_layer
+                        <= layer_idx
+                        < shard_metadata.end_layer
+                        and layer_weights
+                    ):
+                        model.load_weights(list(layer_weights.items()), strict=False)
+                    del layer_weights
            model = pipeline_auto_parallel(model, group, shard_metadata)
            eval_with_timeout(model.parameters(), timeout_seconds, on_timeout)
        case CfgShardMetadata():
@@ -278,6 +360,8 @@ def shard_and_load(
                "this metadata type is only for image generation models"
            )

+    del broadcast_state
+
    # TODO: Do we need this?
    mx.eval(model)

--- a/src/exo/worker/main.py
+++ b/src/exo/worker/main.py
@@ -342,7 +342,7 @@ class Worker:
                    session=self.session_id,
                    event=event,
                )
-                logger.trace(f"Worker published event {idx}: {str(event)[:100]}")
+                logger.debug(f"Worker published event {idx}: {str(event)[:100]}")
                await self.local_event_sender.send(fe)
                self.out_for_delivery[event.event_id] = fe

--- a/src/exo/worker/plan.py
+++ b/src/exo/worker/plan.py
@@ -2,6 +2,7 @@

 from collections.abc import Mapping, Sequence

+from exo.shared.models.model_cards import ModelId
 from exo.shared.types.common import CommandId, NodeId
 from exo.shared.types.tasks import (
    ConnectToGroup,
@@ -16,6 +17,7 @@ from exo.shared.types.tasks import (
    TaskId,
    TaskStatus,
    TextGeneration,
+    TransferModelToDisk,
 )
 from exo.shared.types.worker.downloads import (
    DownloadCompleted,
@@ -34,8 +36,11 @@ from exo.shared.types.worker.runners import (
    RunnerLoading,
    RunnerReady,
    RunnerRunning,
+    RunnerShutdown,
+    RunnerShuttingDown,
    RunnerStatus,
    RunnerWarmingUp,
+    ShardAssignments,
 )
 from exo.worker.runner.runner_supervisor import RunnerSupervisor

@@ -57,6 +62,7 @@ def plan(
        or _create_runner(node_id, runners, instances)
        or _model_needs_download(node_id, runners, global_download_status)
        or _init_distributed_backend(runners, all_runners)
+        or _transfer_model_to_disk(runners, all_runners, global_download_status)
        or _load_model(runners, all_runners, global_download_status)
        or _ready_to_warmup(runners, all_runners)
        or _pending_tasks(runners, tasks, all_runners, input_chunk_buffer)
@@ -121,6 +127,10 @@ def _model_needs_download(
    }

    for runner in runners.values():
+        # Transfer-only instances don't need downloads
+        if runner.bound_instance.instance.shard_assignments.transfer_only:
+            continue
+
        model_id = runner.bound_instance.bound_shard.model_card.model_id
        if isinstance(runner.status, RunnerIdle) and (
            model_id not in download_status
@@ -129,6 +139,15 @@ def _model_needs_download(
                (DownloadOngoing, DownloadCompleted, DownloadFailed),
            )
        ):
+            # For multi-node instances, skip download if a peer already has the model.
+            # The model will be transferred via MLX distributed during LoadModel.
+            instance = runner.bound_instance.instance
+            is_multi_node = len(instance.shard_assignments.node_to_runner) > 1
+            if is_multi_node and _any_peer_has_model(
+                node_id, model_id, instance, global_download_status
+            ):
+                continue
+
            # We don't invalidate download_status randomly in case a file gets deleted on disk
            return DownloadModel(
                instance_id=runner.bound_instance.instance.instance_id,
@@ -186,6 +205,43 @@ def _init_distributed_backend(
    return None


+def _transfer_model_to_disk(
+    runners: Mapping[RunnerId, RunnerSupervisor],
+    all_runners: Mapping[RunnerId, RunnerStatus],
+    global_download_status: Mapping[NodeId, Sequence[DownloadProgress]],
+) -> TransferModelToDisk | None:
+    """For transfer-only instances: after all ranks are connected, emit TransferModelToDisk."""
+    for runner in runners.values():
+        instance = runner.bound_instance.instance
+        shard_assignments = instance.shard_assignments
+
+        if not shard_assignments.transfer_only:
+            continue
+
+        is_runner_connected = isinstance(runner.status, RunnerConnected)
+        all_connected_or_further = all(
+            isinstance(
+                all_runners.get(global_runner_id, None),
+                (RunnerConnected, RunnerLoading, RunnerShuttingDown, RunnerShutdown),
+            )
+            for global_runner_id in shard_assignments.runner_to_shard
+        )
+
+        if is_runner_connected and all_connected_or_further:
+            has_local = _node_has_download(
+                runner.bound_instance.bound_node_id,
+                shard_assignments.model_id,
+                global_download_status,
+            )
+            return TransferModelToDisk(
+                instance_id=instance.instance_id,
+                shard_metadata=runner.bound_instance.bound_shard,
+                has_local_model=has_local,
+            )
+
+    return None
+
+
 def _load_model(
    runners: Mapping[RunnerId, RunnerSupervisor],
    all_runners: Mapping[RunnerId, RunnerStatus],
@@ -195,38 +251,97 @@ def _load_model(
        instance = runner.bound_instance.instance
        shard_assignments = instance.shard_assignments

-        all_local_downloads_complete = all(
-            nid in global_download_status
-            and any(
-                isinstance(dp, DownloadCompleted)
-                and dp.shard_metadata.model_card.model_id == shard_assignments.model_id
-                for dp in global_download_status[nid]
-            )
-            for nid in shard_assignments.node_to_runner
-        )
-        if not all_local_downloads_complete:
+        # Transfer-only instances don't load models for inference
+        if shard_assignments.transfer_only:
            continue

-        is_single_node_instance = len(instance.shard_assignments.runner_to_shard) == 1
-        if is_single_node_instance and isinstance(runner.status, RunnerIdle):
-            return LoadModel(instance_id=instance.instance_id)
+        is_single_node_instance = len(shard_assignments.runner_to_shard) == 1

-        is_runner_waiting = isinstance(runner.status, RunnerConnected)
+        if is_single_node_instance:
+            # Single-node: require local download complete
+            if not _all_downloads_complete(shard_assignments, global_download_status):
+                continue
+            if isinstance(runner.status, RunnerIdle):
+                return LoadModel(instance_id=instance.instance_id, has_local_model=True)
+        else:
+            # Multi-node: require at least one node to have the model downloaded.
+            # Nodes without the model will receive it via MLX distributed transfer
+            # during model loading.
+            if not _any_download_complete(shard_assignments, global_download_status):
+                continue

-        all_ready_for_model = all(
-            isinstance(
-                all_runners.get(global_runner_id, None),
-                (RunnerConnected, RunnerLoading, RunnerLoaded),
+            is_runner_waiting = isinstance(runner.status, RunnerConnected)
+            all_ready_for_model = all(
+                isinstance(
+                    all_runners.get(global_runner_id, None),
+                    (RunnerConnected, RunnerLoading, RunnerLoaded),
+                )
+                for global_runner_id in shard_assignments.runner_to_shard
            )
-            for global_runner_id in shard_assignments.runner_to_shard
-        )

-        if is_runner_waiting and all_ready_for_model:
-            return LoadModel(instance_id=instance.instance_id)
+            if is_runner_waiting and all_ready_for_model:
+                has_local = _node_has_download(
+                    runner.bound_instance.bound_node_id,
+                    shard_assignments.model_id,
+                    global_download_status,
+                )
+                return LoadModel(
+                    instance_id=instance.instance_id,
+                    has_local_model=has_local,
+                )

    return None


+def _node_has_download(
+    nid: NodeId,
+    model_id: ModelId,
+    global_download_status: Mapping[NodeId, Sequence[DownloadProgress]],
+) -> bool:
+    """Check if a specific node has completed downloading the given model."""
+    return any(
+        isinstance(dp, DownloadCompleted)
+        and dp.shard_metadata.model_card.model_id == model_id
+        for dp in global_download_status.get(nid, [])
+    )
+
+
+def _any_peer_has_model(
+    node_id: NodeId,
+    model_id: ModelId,
+    instance: Instance,
+    global_download_status: Mapping[NodeId, Sequence[DownloadProgress]],
+) -> bool:
+    """Check if any other node in the instance already has the model downloaded."""
+    return any(
+        _node_has_download(nid, model_id, global_download_status)
+        for nid in instance.shard_assignments.node_to_runner
+        if nid != node_id
+    )
+
+
+def _all_downloads_complete(
+    shard_assignments: ShardAssignments,
+    global_download_status: Mapping[NodeId, Sequence[DownloadProgress]],
+) -> bool:
+    """Check if ALL nodes in the instance have completed downloading the model."""
+    return all(
+        _node_has_download(nid, shard_assignments.model_id, global_download_status)
+        for nid in shard_assignments.node_to_runner
+    )
+
+
+def _any_download_complete(
+    shard_assignments: ShardAssignments,
+    global_download_status: Mapping[NodeId, Sequence[DownloadProgress]],
+) -> bool:
+    """Check if at least one node in the instance has completed downloading the model."""
+    return any(
+        _node_has_download(nid, shard_assignments.model_id, global_download_status)
+        for nid in shard_assignments.node_to_runner
+    )
+
+
 def _ready_to_warmup(
    runners: Mapping[RunnerId, RunnerSupervisor],
    all_runners: Mapping[RunnerId, RunnerStatus],
@@ -234,6 +349,11 @@ def _ready_to_warmup(
    for runner in runners.values():
        instance = runner.bound_instance.instance
        shard_assignments = instance.shard_assignments
+
+        # Transfer-only instances don't go through warmup
+        if shard_assignments.transfer_only:
+            continue
+
        shard = runner.bound_instance.bound_shard
        device_rank = shard.device_rank
        runner_id = runner.bound_instance.bound_runner_id
--- a/src/exo/worker/runner/runner.py
+++ b/src/exo/worker/runner/runner.py
@@ -43,6 +43,7 @@ from exo.shared.types.tasks import (
    TaskId,
    TaskStatus,
    TextGeneration,
+    TransferModelToDisk,
 )
 from exo.shared.types.text_generation import TextGenerationTaskParams
 from exo.shared.types.worker.instances import BoundInstance
@@ -82,6 +83,11 @@ from exo.worker.engines.image import (
 from exo.worker.engines.mlx import Model
 from exo.worker.engines.mlx.cache import KVPrefixCache
 from exo.worker.engines.mlx.generator.generate import mlx_generate, warmup_inference
+from exo.worker.engines.mlx.model_transfer import (
+    coordinate_transfer,
+    model_path_for_id,
+    transfer_all_files,
+)
 from exo.worker.engines.mlx.utils_mlx import (
    apply_chat_template,
    detect_thinking_prompt_suffix,
@@ -192,7 +198,10 @@ def main(

                    if ModelTask.TextGeneration in shard_metadata.model_card.tasks:
                        model, tokenizer = load_mlx_items(
-                            bound_instance, group, on_timeout=on_model_load_timeout
+                            bound_instance,
+                            group,
+                            on_timeout=on_model_load_timeout,
+                            has_local_model=task.has_local_model,
                        )
                        logger.info(
                            f"model has_tool_calling={tokenizer.has_tool_calling}"
@@ -508,6 +517,27 @@ def main(

                    current_status = RunnerReady()
                    logger.info("runner ready")
+                case TransferModelToDisk() if (
+                    isinstance(current_status, RunnerConnected) and group is not None
+                ):
+                    logger.info("starting disk-to-disk model transfer")
+                    event_sender.send(TaskAcknowledged(task_id=task.task_id))
+
+                    model_path = model_path_for_id(
+                        task.shard_metadata.model_card.model_id
+                    )
+                    _, source_rank = coordinate_transfer(group, task.has_local_model)
+                    is_source = group.rank() == source_rank
+                    transfer_all_files(model_path, group, is_source)
+
+                    logger.info("disk-to-disk model transfer complete")
+                    current_status = RunnerShuttingDown()
+                    event_sender.send(
+                        RunnerStatusUpdated(
+                            runner_id=runner_id, runner_status=current_status
+                        )
+                    )
+                    current_status = RunnerShutdown()
                case Shutdown():
                    current_status = RunnerShuttingDown()
                    logger.info("runner shutting down")
--- a/src/exo/worker/tests/unittests/test_plan/test_download_and_loading.py
+++ b/src/exo/worker/tests/unittests/test_plan/test_download_and_loading.py
@@ -112,6 +112,7 @@ def test_plan_loads_model_when_all_shards_downloaded_and_waiting():

    assert isinstance(result, LoadModel)
    assert result.instance_id == INSTANCE_1_ID
+    assert result.has_local_model is True


 def test_plan_does_not_request_download_when_shard_already_downloaded():
@@ -157,10 +158,11 @@ def test_plan_does_not_request_download_when_shard_already_downloaded():
    assert not isinstance(result, plan_mod.DownloadModel)


-def test_plan_does_not_load_model_until_all_shards_downloaded_globally():
+def test_plan_loads_model_when_any_node_has_download_for_multi_node():
    """
-    LoadModel should not be emitted while some shards are still missing from
-    the global_download_status.
+    For multi-node instances, LoadModel should be emitted when at least one
+    node has the model downloaded. Nodes without the model will receive it
+    via MLX distributed transfer during model loading.
    """
    shard1 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=0, world_size=2)
    shard2 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=1, world_size=2)
@@ -185,6 +187,7 @@ def test_plan_does_not_load_model_until_all_shards_downloaded_globally():
        RUNNER_2_ID: RunnerConnected(),
    }

+    # Only NODE_A has the model — LoadModel should still fire
    global_download_status = {
        NODE_A: [
            DownloadCompleted(
@@ -203,19 +206,42 @@ def test_plan_does_not_load_model_until_all_shards_downloaded_globally():
        tasks={},
    )

-    assert result is None
+    assert isinstance(result, LoadModel)
+    assert result.instance_id == INSTANCE_1_ID
+    assert result.has_local_model is True

-    global_download_status = {
-        NODE_A: [
-            DownloadCompleted(
-                shard_metadata=shard1, node_id=NODE_A, total_bytes=Memory()
-            )
-        ],
-        NODE_B: [
-            DownloadCompleted(
-                shard_metadata=shard2, node_id=NODE_B, total_bytes=Memory()
-            )
-        ],  # NODE_B has no downloads completed yet
+
+def test_plan_does_not_load_model_when_no_node_has_download():
+    """
+    LoadModel should not be emitted when no node has the model downloaded.
+    """
+    shard1 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=0, world_size=2)
+    shard2 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=1, world_size=2)
+    instance = get_mlx_ring_instance(
+        instance_id=INSTANCE_1_ID,
+        model_id=MODEL_A_ID,
+        node_to_runner={NODE_A: RUNNER_1_ID, NODE_B: RUNNER_2_ID},
+        runner_to_shard={RUNNER_1_ID: shard1, RUNNER_2_ID: shard2},
+    )
+
+    bound_instance = BoundInstance(
+        instance=instance, bound_runner_id=RUNNER_1_ID, bound_node_id=NODE_A
+    )
+    local_runner = FakeRunnerSupervisor(
+        bound_instance=bound_instance, status=RunnerConnected()
+    )
+
+    runners = {RUNNER_1_ID: local_runner}
+    instances = {INSTANCE_1_ID: instance}
+    all_runners = {
+        RUNNER_1_ID: RunnerConnected(),
+        RUNNER_2_ID: RunnerConnected(),
+    }
+
+    # No node has the model
+    global_download_status: dict[NodeId, list[DownloadProgress]] = {
+        NODE_A: [],
+        NODE_B: [],
    }

    result = plan_mod.plan(
@@ -227,4 +253,57 @@ def test_plan_does_not_load_model_until_all_shards_downloaded_globally():
        tasks={},
    )

-    assert result is not None
+    assert result is None
+
+
+def test_plan_load_model_has_local_model_false_when_node_missing_download():
+    """
+    For multi-node instances, when the local node does NOT have the model
+    but a peer does, LoadModel should be emitted with has_local_model=False.
+    """
+    shard1 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=0, world_size=2)
+    shard2 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=1, world_size=2)
+    instance = get_mlx_ring_instance(
+        instance_id=INSTANCE_1_ID,
+        model_id=MODEL_A_ID,
+        node_to_runner={NODE_A: RUNNER_1_ID, NODE_B: RUNNER_2_ID},
+        runner_to_shard={RUNNER_1_ID: shard1, RUNNER_2_ID: shard2},
+    )
+
+    # NODE_B is the local node (bound_node_id=NODE_B), it does NOT have the model
+    bound_instance = BoundInstance(
+        instance=instance, bound_runner_id=RUNNER_2_ID, bound_node_id=NODE_B
+    )
+    local_runner = FakeRunnerSupervisor(
+        bound_instance=bound_instance, status=RunnerConnected()
+    )
+
+    runners = {RUNNER_2_ID: local_runner}
+    instances = {INSTANCE_1_ID: instance}
+    all_runners = {
+        RUNNER_1_ID: RunnerConnected(),
+        RUNNER_2_ID: RunnerConnected(),
+    }
+
+    # Only NODE_A has the model, NODE_B does not
+    global_download_status: dict[NodeId, list[DownloadProgress]] = {
+        NODE_A: [
+            DownloadCompleted(
+                shard_metadata=shard1, node_id=NODE_A, total_bytes=Memory()
+            )
+        ],
+        NODE_B: [],
+    }
+
+    result = plan_mod.plan(
+        node_id=NODE_B,
+        runners=runners,  # type: ignore
+        global_download_status=global_download_status,
+        instances=instances,
+        all_runners=all_runners,
+        tasks={},
+    )
+
+    assert isinstance(result, LoadModel)
+    assert result.instance_id == INSTANCE_1_ID
+    assert result.has_local_model is False
--- a/tests/auto_bench.sh
+++ b/tests/auto_bench.sh
@@ -28,12 +28,12 @@ trap 'cleanup' EXIT INT TERM

 for host; do
  ssh -T -o BatchMode=yes -o ServerAliveInterval=30 "$host@$host" \
-    "/nix/var/nix/profiles/default/bin/nix build github:exo-explore/exo/$commit" &
+    "EXO_LIBP2P_NAMESPACE=$commit /nix/var/nix/profiles/default/bin/nix build github:exo-explore/exo/$commit" &
 done
 wait
 for host; do
  ssh -T -o BatchMode=yes -o ServerAliveInterval=30 "$host@$host" \
-    "/nix/var/nix/profiles/default/bin/nix run github:exo-explore/exo/$commit -- --namespace $commit" &>/dev/null &
+    "EXO_LIBP2P_NAMESPACE=$commit /nix/var/nix/profiles/default/bin/nix run github:exo-explore/exo/$commit" &>/dev/null &
 done

 for host; do
--- a/tests/run_exo_on.sh
+++ b/tests/run_exo_on.sh
@@ -35,7 +35,7 @@ i=0
 for host; do
  colour=${colours[i++ % 4]}
  ssh -T -o BatchMode=yes -o ServerAliveInterval=30 "$host@$host" \
-    "/nix/var/nix/profiles/default/bin/nix run github:exo-explore/exo/$commit -- --namespace $commit" |&
+    "EXO_LIBP2P_NAMESPACE=$commit /nix/var/nix/profiles/default/bin/nix run github:exo-explore/exo/$commit" |&
    awk -v p="${colour}[${host}]${reset}" '{ print p $0; fflush() }' &
 done

--- a/uv.lock
+++ b/uv.lock
@@ -193,14 +193,20 @@ sdist = { url = "https://files.pythonhosted.org/packages/eb/56/b1ba7935a17738ae8
 wheels = [
    { url = "https://files.pythonhosted.org/packages/b0/1e/d22cc63332bd59b06481ceaac49d6c507598642e2230f201649058a7e704/cffi-2.0.0-cp313-cp313-manylinux1_i686.manylinux2014_i686.manylinux_2_17_i686.manylinux_2_5_i686.whl", hash = "sha256:07b271772c100085dd28b74fa0cd81c8fb1a3ba18b21e03d7c27f3436a10606b", size = 212446, upload-time = "2025-09-08T23:23:03.472Z" },
    { url = "https://files.pythonhosted.org/packages/a9/f5/a2c23eb03b61a0b8747f211eb716446c826ad66818ddc7810cc2cc19b3f2/cffi-2.0.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d48a880098c96020b02d5a1f7d9251308510ce8858940e6fa99ece33f610838b", size = 220101, upload-time = "2025-09-08T23:23:04.792Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/7f/e6647792fc5850d634695bc0e6ab4111ae88e89981d35ac269956605feba/cffi-2.0.0-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:f93fd8e5c8c0a4aa1f424d6173f14a892044054871c771f8566e4008eaa359d2", size = 207948, upload-time = "2025-09-08T23:23:06.127Z" },
+    { url = "https://files.pythonhosted.org/packages/cb/1e/a5a1bd6f1fb30f22573f76533de12a00bf274abcdc55c8edab639078abb6/cffi-2.0.0-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:dd4f05f54a52fb558f1ba9f528228066954fee3ebe629fc1660d874d040ae5a3", size = 206422, upload-time = "2025-09-08T23:23:07.753Z" },
    { url = "https://files.pythonhosted.org/packages/98/df/0a1755e750013a2081e863e7cd37e0cdd02664372c754e5560099eb7aa44/cffi-2.0.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:c8d3b5532fc71b7a77c09192b4a5a200ea992702734a2e9279a37f2478236f26", size = 219499, upload-time = "2025-09-08T23:23:09.648Z" },
    { url = "https://files.pythonhosted.org/packages/50/e1/a969e687fcf9ea58e6e2a928ad5e2dd88cc12f6f0ab477e9971f2309b57c/cffi-2.0.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:d9b29c1f0ae438d5ee9acb31cadee00a58c46cc9c0b2f9038c6b0b3470877a8c", size = 222928, upload-time = "2025-09-08T23:23:10.928Z" },
    { url = "https://files.pythonhosted.org/packages/36/54/0362578dd2c9e557a28ac77698ed67323ed5b9775ca9d3fe73fe191bb5d8/cffi-2.0.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:6d50360be4546678fc1b79ffe7a66265e28667840010348dd69a314145807a1b", size = 221302, upload-time = "2025-09-08T23:23:12.42Z" },
    { url = "https://files.pythonhosted.org/packages/d6/43/0e822876f87ea8a4ef95442c3d766a06a51fc5298823f884ef87aaad168c/cffi-2.0.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:24b6f81f1983e6df8db3adc38562c83f7d4a0c36162885ec7f7b77c7dcbec97b", size = 220049, upload-time = "2025-09-08T23:23:20.853Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/89/76799151d9c2d2d1ead63c2429da9ea9d7aac304603de0c6e8764e6e8e70/cffi-2.0.0-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:12873ca6cb9b0f0d3a0da705d6086fe911591737a59f28b7936bdfed27c0d47c", size = 207793, upload-time = "2025-09-08T23:23:22.08Z" },
+    { url = "https://files.pythonhosted.org/packages/bb/dd/3465b14bb9e24ee24cb88c9e3730f6de63111fffe513492bf8c808a3547e/cffi-2.0.0-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:d9b97165e8aed9272a6bb17c01e3cc5871a594a446ebedc996e2397a1c1ea8ef", size = 206300, upload-time = "2025-09-08T23:23:23.314Z" },
    { url = "https://files.pythonhosted.org/packages/47/d9/d83e293854571c877a92da46fdec39158f8d7e68da75bf73581225d28e90/cffi-2.0.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:afb8db5439b81cf9c9d0c80404b60c3cc9c3add93e114dcae767f1477cb53775", size = 219244, upload-time = "2025-09-08T23:23:24.541Z" },
    { url = "https://files.pythonhosted.org/packages/2b/0f/1f177e3683aead2bb00f7679a16451d302c436b5cbf2505f0ea8146ef59e/cffi-2.0.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:737fe7d37e1a1bffe70bd5754ea763a62a066dc5913ca57e957824b72a85e205", size = 222828, upload-time = "2025-09-08T23:23:26.143Z" },
    { url = "https://files.pythonhosted.org/packages/c6/0f/cafacebd4b040e3119dcb32fed8bdef8dfe94da653155f9d0b9dc660166e/cffi-2.0.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:38100abb9d1b1435bc4cc340bb4489635dc2f0da7456590877030c9b3d40b0c1", size = 220926, upload-time = "2025-09-08T23:23:27.873Z" },
    { url = "https://files.pythonhosted.org/packages/be/b4/c56878d0d1755cf9caa54ba71e5d049479c52f9e4afc230f06822162ab2f/cffi-2.0.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:7cc09976e8b56f8cebd752f7113ad07752461f48a58cbba644139015ac24954c", size = 221593, upload-time = "2025-09-08T23:23:31.91Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/0d/eb704606dfe8033e7128df5e90fee946bbcb64a04fcdaa97321309004000/cffi-2.0.0-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:92b68146a71df78564e4ef48af17551a5ddd142e5190cdf2c5624d0c3ff5b2e8", size = 209354, upload-time = "2025-09-08T23:23:33.214Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/19/3c435d727b368ca475fb8742ab97c9cb13a0de600ce86f62eab7fa3eea60/cffi-2.0.0-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:b1e74d11748e7e98e2f426ab176d4ed720a64412b6a15054378afdb71e0f37dc", size = 208480, upload-time = "2025-09-08T23:23:34.495Z" },
    { url = "https://files.pythonhosted.org/packages/d0/44/681604464ed9541673e486521497406fadcc15b5217c3e326b061696899a/cffi-2.0.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:28a3a209b96630bca57cce802da70c266eb08c6e97e5afd61a75611ee6c64592", size = 221584, upload-time = "2025-09-08T23:23:36.096Z" },
    { url = "https://files.pythonhosted.org/packages/25/8e/342a504ff018a2825d395d44d63a767dd8ebc927ebda557fecdaca3ac33a/cffi-2.0.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:7553fb2090d71822f02c629afe6042c299edf91ba1bf94951165613553984512", size = 224443, upload-time = "2025-09-08T23:23:37.328Z" },
    { url = "https://files.pythonhosted.org/packages/e1/5e/b666bacbbc60fbf415ba9988324a132c9a7a0448a9a8f125074671c0f2c3/cffi-2.0.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:6c6c373cfc5c83a975506110d17457138c8c63016b563cc9ed6e056a82f13ce4", size = 223437, upload-time = "2025-09-08T23:23:38.945Z" },
@@ -306,8 +312,10 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/5c/49/498c86566a1d80e978b42f0d702795f69887005548c041636df6ae1ca64c/cryptography-46.0.3-cp311-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:01ca9ff2885f3acc98c29f1860552e37f6d7c7d013d7334ff2a9de43a449315d", size = 4450807, upload-time = "2025-10-15T23:16:56.414Z" },
    { url = "https://files.pythonhosted.org/packages/4b/0a/863a3604112174c8624a2ac3c038662d9e59970c7f926acdcfaed8d61142/cryptography-46.0.3-cp311-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:6eae65d4c3d33da080cff9c4ab1f711b15c1d9760809dad6ea763f3812d254cb", size = 4299615, upload-time = "2025-10-15T23:16:58.442Z" },
    { url = "https://files.pythonhosted.org/packages/64/02/b73a533f6b64a69f3cd3872acb6ebc12aef924d8d103133bb3ea750dc703/cryptography-46.0.3-cp311-abi3-manylinux_2_28_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:e5bf0ed4490068a2e72ac03d786693adeb909981cc596425d09032d372bcc849", size = 4016800, upload-time = "2025-10-15T23:17:00.378Z" },
+    { url = "https://files.pythonhosted.org/packages/25/d5/16e41afbfa450cde85a3b7ec599bebefaef16b5c6ba4ec49a3532336ed72/cryptography-46.0.3-cp311-abi3-manylinux_2_28_ppc64le.whl", hash = "sha256:5ecfccd2329e37e9b7112a888e76d9feca2347f12f37918facbb893d7bb88ee8", size = 4984707, upload-time = "2025-10-15T23:17:01.98Z" },
    { url = "https://files.pythonhosted.org/packages/c9/56/e7e69b427c3878352c2fb9b450bd0e19ed552753491d39d7d0a2f5226d41/cryptography-46.0.3-cp311-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:a2c0cd47381a3229c403062f764160d57d4d175e022c1df84e168c6251a22eec", size = 4482541, upload-time = "2025-10-15T23:17:04.078Z" },
    { url = "https://files.pythonhosted.org/packages/78/f6/50736d40d97e8483172f1bb6e698895b92a223dba513b0ca6f06b2365339/cryptography-46.0.3-cp311-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:549e234ff32571b1f4076ac269fcce7a808d3bf98b76c8dd560e42dbc66d7d91", size = 4299464, upload-time = "2025-10-15T23:17:05.483Z" },
+    { url = "https://files.pythonhosted.org/packages/00/de/d8e26b1a855f19d9994a19c702fa2e93b0456beccbcfe437eda00e0701f2/cryptography-46.0.3-cp311-abi3-manylinux_2_34_ppc64le.whl", hash = "sha256:c0a7bb1a68a5d3471880e264621346c48665b3bf1c3759d682fc0864c540bd9e", size = 4950838, upload-time = "2025-10-15T23:17:07.425Z" },
    { url = "https://files.pythonhosted.org/packages/8f/29/798fc4ec461a1c9e9f735f2fc58741b0daae30688f41b2497dcbc9ed1355/cryptography-46.0.3-cp311-abi3-manylinux_2_34_x86_64.whl", hash = "sha256:10b01676fc208c3e6feeb25a8b83d81767e8059e1fe86e1dc62d10a3018fa926", size = 4481596, upload-time = "2025-10-15T23:17:09.343Z" },
    { url = "https://files.pythonhosted.org/packages/15/8d/03cd48b20a573adfff7652b76271078e3045b9f49387920e7f1f631d125e/cryptography-46.0.3-cp311-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:0abf1ffd6e57c67e92af68330d05760b7b7efb243aab8377e583284dbab72c71", size = 4426782, upload-time = "2025-10-15T23:17:11.22Z" },
    { url = "https://files.pythonhosted.org/packages/fa/b1/ebacbfe53317d55cf33165bda24c86523497a6881f339f9aae5c2e13e57b/cryptography-46.0.3-cp311-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:a04bee9ab6a4da801eb9b51f1b708a1b5b5c9eb48c03f74198464c66f0d344ac", size = 4698381, upload-time = "2025-10-15T23:17:12.829Z" },
@@ -315,8 +323,10 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/c5/fd/bc1daf8230eaa075184cbbf5f8cd00ba9db4fd32d63fb83da4671b72ed8a/cryptography-46.0.3-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:39b6755623145ad5eff1dab323f4eae2a32a77a7abef2c5089a04a3d04366715", size = 4435078, upload-time = "2025-10-15T23:17:23.042Z" },
    { url = "https://files.pythonhosted.org/packages/82/98/d3bd5407ce4c60017f8ff9e63ffee4200ab3e23fe05b765cab805a7db008/cryptography-46.0.3-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:db391fa7c66df6762ee3f00c95a89e6d428f4d60e7abc8328f4fe155b5ac6e54", size = 4293460, upload-time = "2025-10-15T23:17:24.885Z" },
    { url = "https://files.pythonhosted.org/packages/26/e9/e23e7900983c2b8af7a08098db406cf989d7f09caea7897e347598d4cd5b/cryptography-46.0.3-cp314-cp314t-manylinux_2_28_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:78a97cf6a8839a48c49271cdcbd5cf37ca2c1d6b7fdd86cc864f302b5e9bf459", size = 3995237, upload-time = "2025-10-15T23:17:26.449Z" },
+    { url = "https://files.pythonhosted.org/packages/91/15/af68c509d4a138cfe299d0d7ddb14afba15233223ebd933b4bbdbc7155d3/cryptography-46.0.3-cp314-cp314t-manylinux_2_28_ppc64le.whl", hash = "sha256:dfb781ff7eaa91a6f7fd41776ec37c5853c795d3b358d4896fdbb5df168af422", size = 4967344, upload-time = "2025-10-15T23:17:28.06Z" },
    { url = "https://files.pythonhosted.org/packages/ca/e3/8643d077c53868b681af077edf6b3cb58288b5423610f21c62aadcbe99f4/cryptography-46.0.3-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:6f61efb26e76c45c4a227835ddeae96d83624fb0d29eb5df5b96e14ed1a0afb7", size = 4466564, upload-time = "2025-10-15T23:17:29.665Z" },
    { url = "https://files.pythonhosted.org/packages/0e/43/c1e8726fa59c236ff477ff2b5dc071e54b21e5a1e51aa2cee1676f1c986f/cryptography-46.0.3-cp314-cp314t-manylinux_2_34_aarch64.whl", hash = "sha256:23b1a8f26e43f47ceb6d6a43115f33a5a37d57df4ea0ca295b780ae8546e8044", size = 4292415, upload-time = "2025-10-15T23:17:31.686Z" },
+    { url = "https://files.pythonhosted.org/packages/42/f9/2f8fefdb1aee8a8e3256a0568cffc4e6d517b256a2fe97a029b3f1b9fe7e/cryptography-46.0.3-cp314-cp314t-manylinux_2_34_ppc64le.whl", hash = "sha256:b419ae593c86b87014b9be7396b385491ad7f320bde96826d0dd174459e54665", size = 4931457, upload-time = "2025-10-15T23:17:33.478Z" },
    { url = "https://files.pythonhosted.org/packages/79/30/9b54127a9a778ccd6d27c3da7563e9f2d341826075ceab89ae3b41bf5be2/cryptography-46.0.3-cp314-cp314t-manylinux_2_34_x86_64.whl", hash = "sha256:50fc3343ac490c6b08c0cf0d704e881d0d660be923fd3076db3e932007e726e3", size = 4466074, upload-time = "2025-10-15T23:17:35.158Z" },
    { url = "https://files.pythonhosted.org/packages/ac/68/b4f4a10928e26c941b1b6a179143af9f4d27d88fe84a6a3c53592d2e76bf/cryptography-46.0.3-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:22d7e97932f511d6b0b04f2bfd818d73dcd5928db509460aaf48384778eb6d20", size = 4420569, upload-time = "2025-10-15T23:17:37.188Z" },
    { url = "https://files.pythonhosted.org/packages/a3/49/3746dab4c0d1979888f125226357d3262a6dd40e114ac29e3d2abdf1ec55/cryptography-46.0.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:d55f3dffadd674514ad19451161118fd010988540cee43d8bc20675e775925de", size = 4681941, upload-time = "2025-10-15T23:17:39.236Z" },
@@ -324,8 +334,10 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/26/42/fa8389d4478368743e24e61eea78846a0006caffaf72ea24a15159215a14/cryptography-46.0.3-cp38-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:15ab9b093e8f09daab0f2159bb7e47532596075139dd74365da52ecc9cb46c5d", size = 4440029, upload-time = "2025-10-15T23:17:49.837Z" },
    { url = "https://files.pythonhosted.org/packages/5f/eb/f483db0ec5ac040824f269e93dd2bd8a21ecd1027e77ad7bdf6914f2fd80/cryptography-46.0.3-cp38-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:46acf53b40ea38f9c6c229599a4a13f0d46a6c3fa9ef19fc1a124d62e338dfa0", size = 4297222, upload-time = "2025-10-15T23:17:51.357Z" },
    { url = "https://files.pythonhosted.org/packages/fd/cf/da9502c4e1912cb1da3807ea3618a6829bee8207456fbbeebc361ec38ba3/cryptography-46.0.3-cp38-abi3-manylinux_2_28_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:10ca84c4668d066a9878890047f03546f3ae0a6b8b39b697457b7757aaf18dbc", size = 4012280, upload-time = "2025-10-15T23:17:52.964Z" },
+    { url = "https://files.pythonhosted.org/packages/6b/8f/9adb86b93330e0df8b3dcf03eae67c33ba89958fc2e03862ef1ac2b42465/cryptography-46.0.3-cp38-abi3-manylinux_2_28_ppc64le.whl", hash = "sha256:36e627112085bb3b81b19fed209c05ce2a52ee8b15d161b7c643a7d5a88491f3", size = 4978958, upload-time = "2025-10-15T23:17:54.965Z" },
    { url = "https://files.pythonhosted.org/packages/d1/a0/5fa77988289c34bdb9f913f5606ecc9ada1adb5ae870bd0d1054a7021cc4/cryptography-46.0.3-cp38-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:1000713389b75c449a6e979ffc7dcc8ac90b437048766cef052d4d30b8220971", size = 4473714, upload-time = "2025-10-15T23:17:56.754Z" },
    { url = "https://files.pythonhosted.org/packages/14/e5/fc82d72a58d41c393697aa18c9abe5ae1214ff6f2a5c18ac470f92777895/cryptography-46.0.3-cp38-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:b02cf04496f6576afffef5ddd04a0cb7d49cf6be16a9059d793a30b035f6b6ac", size = 4296970, upload-time = "2025-10-15T23:17:58.588Z" },
+    { url = "https://files.pythonhosted.org/packages/78/06/5663ed35438d0b09056973994f1aec467492b33bd31da36e468b01ec1097/cryptography-46.0.3-cp38-abi3-manylinux_2_34_ppc64le.whl", hash = "sha256:71e842ec9bc7abf543b47cf86b9a743baa95f4677d22baa4c7d5c69e49e9bc04", size = 4940236, upload-time = "2025-10-15T23:18:00.897Z" },
    { url = "https://files.pythonhosted.org/packages/fc/59/873633f3f2dcd8a053b8dd1d38f783043b5fce589c0f6988bf55ef57e43e/cryptography-46.0.3-cp38-abi3-manylinux_2_34_x86_64.whl", hash = "sha256:402b58fc32614f00980b66d6e56a5b4118e6cb362ae8f3fda141ba4689bd4506", size = 4472642, upload-time = "2025-10-15T23:18:02.749Z" },
    { url = "https://files.pythonhosted.org/packages/3d/39/8e71f3930e40f6877737d6f69248cf74d4e34b886a3967d32f919cc50d3b/cryptography-46.0.3-cp38-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:ef639cb3372f69ec44915fafcd6698b6cc78fbe0c2ea41be867f6ed612811963", size = 4423126, upload-time = "2025-10-15T23:18:04.85Z" },
    { url = "https://files.pythonhosted.org/packages/cd/c7/f65027c2810e14c3e7268353b1681932b87e5a48e65505d8cc17c99e36ae/cryptography-46.0.3-cp38-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:3b51b8ca4f1c6453d8829e1eb7299499ca7f313900dd4d89a24b8b87c0a780d4", size = 4686573, upload-time = "2025-10-15T23:18:06.908Z" },
Author	SHA1	Message	Date
Alex Cheema	3425c0ef51	fix: always transfer metadata files to all nodes before weight loading Move transfer_metadata_files() outside the conditional needs_transfer block so it always runs in multi-node setups. This ensures config.json, tokenizer files, and other metadata are present on all nodes before load_model() is called, regardless of download status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 15:01:00 -08:00
Alex Cheema	5734408157	feat: replace has_weight_files with state-based has_local_model + fix pipeline peak memory - Add has_local_model field to LoadModel and TransferModelToDisk tasks, computed from download status in plan.py instead of filesystem checks - Remove has_weight_files() function entirely - In pipeline broadcast path, only load weights for layers in this node's [start_layer, end_layer) range — discard out-of-range results to reduce peak memory from ~22GB to ~10GB for 2-node pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 14:26:23 -08:00
Alex Cheema	a34970bb5c	feat: stream weight broadcast layer-by-layer to reduce peak memory Instead of accumulating all weight tensors (~18GB for GLM) in a dict before loading into the model, broadcast weights incrementally during the sharding loop. Non-layer weights (embeddings, norms, lm_head) are loaded upfront (~600MB), then each layer's weights are broadcast and loaded just before that layer is sharded. Peak memory drops from ~22GB to ~10GB, matching lazy-from-disk behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 13:38:32 -08:00
Alex Cheema	02a78afb87	fix: free broadcast weights before sharding to halve peak memory After model.load_weights(), both the broadcast_weights dict and the model's parameter tree hold references to the same arrays. During tensor_auto_parallel, the old full-size arrays can't be freed because the dict still references them, causing ~2x peak memory. Delete the dict before sharding so arrays are freed as each layer is replaced with its sharded version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 13:14:12 -08:00
Alex Cheema	e1975558c1	refactor: invert metadata detection to exclude .safetensors Instead of maintaining an allowlist of metadata extensions (which broke when chat_template.jinja was missing), treat everything that isn't a .safetensors file as metadata. More robust against new file formats. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 13:01:22 -08:00
Alex Cheema	9d6d60c411	fix: include .jinja files in metadata transfer chat_template.jinja is needed by transformers for chat formatting (e.g., GLM-4.7-Flash stores its chat template there). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 12:55:33 -08:00
Alex Cheema	e13e7af6e8	fix: prevent OOM on receiver by using lazy=True with broadcast weights When the receiver has no safetensors files, load_model's internal nn.quantize skips quantization (class_predicate finds no .scales keys in empty weights dict), leaving the model un-quantized as full fp16. With lazy=False, mx.eval(model.parameters()) materializes ~72GB of fp16 data for a 36B-param model on a 24GB machine → silent OOM kill. Fix: use lazy=True when broadcast_weights is available. This skips the eager eval, and our code handles quantization correctly before loading the broadcast weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 12:50:14 -08:00
Alex Cheema	fb2b0148ee	fix: fall back to non-lazy mx.load when lazy param unavailable MLX < 0.31 doesn't support mx.load(lazy=True). Try lazy first, fall back to eager loading on TypeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 12:32:39 -08:00
Alex Cheema	7c0147f544	refactor: clean up transfer code structure - Collapse _broadcast_int/_broadcast_bytes/_broadcast_json chain into single _broadcast_json (each was only called once) - Extract _node_has_download helper to deduplicate download-checking logic across _any_peer_has_model, _all_downloads_complete, and _any_download_complete - Remove unused has_metadata_files function - Fix module docstring ("two" → "three" transfer modes) - Remove section divider comment banners - Simplify redundant is_source check in temp_dir conditional Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 12:18:45 -08:00
Alex Cheema	f0d7560ec0	refactor: move model_transfer imports to top of file No circular dependency — model_transfer.py doesn't import from utils_mlx.py or runner.py. Also remove redundant `import json` that shadowed the module-level import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 12:08:56 -08:00
Alex Cheema	a1e4d5aba1	fix: use single source rank for all_sum and lazy-load weights Previously, all ranks with the model sent real data in all_sum, which corrupts results with >2 nodes (data+data+0 = 2*data). Now only the designated source_rank sends; all others send zeros regardless of whether they have local files. Also switch to mx.load(lazy=True) + weights.pop() so the source only has one tensor in memory at a time instead of loading all safetensors upfront. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:59:51 -08:00
Alex Cheema	2c500ab8cf	chore: remove temporary debug weight comparison logging The transfer is working end-to-end. Remove the _debug_compare_weights function that was used to diagnose the shape mismatch issue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:48:16 -08:00
Alex Cheema	f9d7af4dbf	fix: apply nn.quantize on receiver before loading broadcast weights Root cause: load_model(lazy=False, strict=False) without weight files skips quantization, creating float32 Linear layers (310 params) instead of uint32 QuantizedLinear layers (704 params). The broadcast weights from the source are quantized, so load_weights silently fails to replace them due to shape mismatches (e.g., broadcast (2048,128)/uint32 vs model (2048,1024)/float32). Fix: read quantization config from config.json and call nn.quantize() before loading broadcast weights, ensuring QuantizedLinear layers match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:42:58 -08:00
Alex Cheema	f70be12b04	debug: add weight comparison logging for transfer diagnosis Temporary debug logging to compare broadcast weight names/shapes against model parameters after load_model to diagnose rms_norm crash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:36:59 -08:00
Alex Cheema	8c49adc97a	fix: use lazy=False and exclude safetensors index for weight broadcast receivers When receiving weights via MLX distributed broadcast, the receiver node has no .safetensors files on disk. Two issues caused rms_norm shape mismatch during warmup: 1. model.safetensors.index.json was transferred as metadata (has .json ext), causing load_model to create lazy tensor refs to nonexistent files 2. lazy=True created dangling references even without the index file Fix: exclude *.safetensors.index.json from metadata transfer, and use lazy=False when receiver has no local weight files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:24:31 -08:00
Alex Cheema	e86e6a9d1e	feat: MLX distributed model transfer across nodes Add two transfer features using MLX distributed all_sum: 1. Disk-to-memory (automatic): During model loading in multi-node instances, if a peer has the model and the local node doesn't, stream weight tensors directly into memory via all_sum. No disk write on the receiver. 2. Disk-to-disk (explicit API): POST /v1/models/{model_id}/distribute copies all model files from source to target nodes' disk via MLX distributed file transfer. New module: model_transfer.py with coordinate_transfer(), transfer_metadata_files(), broadcast_model_weights(), transfer_all_files() Modified plan.py to skip downloads when peers have the model and accept partial downloads for multi-node LoadModel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:12:13 -08:00
Jake Hillion	cc33213842	bench: add --settle-timeout for cluster startup retry (#1449 ) exo_bench.py fails if started too soon after a cluster starts because the topology hasn't populated yet, resulting in no valid placements. Extracted the preview-fetch-and-filter logic into a `fetch_and_filter_placements` helper and added a retry loop with exponential backoff (1s initial, 2x multiplier, 60s cap). The new `--settle-timeout` flag controls how long to retry (default 0 = try once, preserving existing behaviour). Each retry logs a warning explaining the cluster may still be settling. Test plan: - Tested on several freshly started clusters. This used to fail a lot, now it succeeds.	2026-02-12 16:38:09 +00:00