pipeline parallel fix

2025-12-23 22:27:50 -05:00 · 2025-11-07 18:19:19 -08:00
parent 612f58c78d
commit 9058b117c0
10 changed files with 286 additions and 115 deletions
--- a/TODO.md
+++ b/TODO.md
@@ -17,9 +17,37 @@
 19. Fix mx.distributed.Group typing.
 20. Add chat completion cancellations (e.g OpenWebUI has something for cancelling an ongoing request).
 21. Make two separate things: tensor or pipeline, and ring or ibv.
+22. When downloading for the first time, stuff times out and I think the model never ends up actually loading into memory, or something.
+23. Do we need cache_limit? We went back and forth on that a lot because we thought it might be causing issues. One problem is it sets it relative to model size. So if you have multiple models loaded in it will take the most recent model size for the cache_limit. This is problematic if you launch DeepSeek -> Llama for example.

 Potential refactors:

 1. Make ForwarderEvent typed
 2. Topology can be simplified
 3. Get rid of InstanceReplacedAtomically
+
+Random errors we've run into:
+
+1. exo.shared.types.worker.common.RunnerError: RuntimeError: [ibv] Couldn't connect (error: 60). Traceback: Traceback (most recent call last):
+  File "/Users/puffin4/actions-runner/_work/exo/exo/src/exo/worker/runner/runner.py", line 54, in main
+    model, tokenizer, sampler, group = await loop.run_in_executor(
+                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    ...<8 lines>...
+    )
+    ^
+  File "/nix/store/s7ik6dazn4nd2jdg9l36qf5q0z18sjyk-python3-3.13.8/lib/python3.13/concurrent/futures/thread.py", line 59, in run
+    result = self.fn(*self.args, **self.kwargs)
+  File "/Users/puffin4/actions-runner/_work/exo/exo/src/exo/engines/mlx/utils_mlx.py", line 149, in initialize_mlx
+    group = mlx_distributed_init(
+        model_shard_meta.device_rank,
+    ...<4 lines>...
+        or (mlx_ibv_devices is not None and len(mlx_ibv_devices) > 1),
+    )
+  File "/Users/puffin4/actions-runner/_work/exo/exo/src/exo/engines/mlx/utils_mlx.py", line 124, in mlx_distributed_init
+    group = mx.distributed.init(
+        backend="ring" if hosts is not None else "ibv",
+        strict=strict,
+    )
+RuntimeError: [ibv] Couldn't connect (error: 60)
+
+2.