fix: clean up stale runners from state when instance is deleted

apply_instance_deleted() previously only removed the instance from state.instances, leaving its runner entries orphaned in state.runners with their last known status (e.g. RunnerReady). After a node kill and rejoin, readiness checks would see these stale entries and attempt inference against dead runner processes, causing post-recovery failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: eliminate serialization bottlenecks in continuous batching pipeline
2026-02-17 17:41:33 -05:00 · 2026-02-15 18:41:22 -08:00 · 2026-02-15 18:36:56 -08:00 · 2026-02-15 17:52:04 -08:00 · 2026-02-15 17:02:07 -08:00 · 2026-02-15 16:55:38 -08:00
63 changed files with 4343 additions and 2032 deletions
--- a/.github/workflows/pipeline.yml
+++ b/.github/workflows/pipeline.yml
@@ -8,6 +8,33 @@ on:
      - main

 jobs:
+  typecheck:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          lfs: false
+
+      - uses: cachix/install-nix-action@v31
+        with:
+          nix_path: nixpkgs=channel:nixos-unstable
+
+      - uses: cachix/cachix-action@v14
+        name: Configure Cachix
+        with:
+          name: exo
+          authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
+
+      - name: Load nix develop environment
+        run: nix run github:nicknovitski/nix-develop/v1
+
+      - name: Sync dependencies
+        run: uv sync --all-packages
+
+      - name: Run type checker
+        run: uv run basedpyright --project pyproject.toml
+
  nix:
    name: Build and check (${{ matrix.system }})
    runs-on: ${{ matrix.runner }}
--- a/.mlx_typings/mlx_lm/generate.pyi
+++ b/.mlx_typings/mlx_lm/generate.pyi
@@ -276,24 +276,23 @@ class BatchGenerator:
        logprobs: mx.array
        finish_reason: Optional[str]

+    unprocessed_prompts: List[Any]
+
    def __init__(
        self,
-        model,
+        model: nn.Module,
        max_tokens: int = ...,
-        stop_tokens: Optional[set] = ...,
+        stop_tokens: Optional[set[int]] = ...,
        sampler: Optional[Callable[[mx.array], mx.array]] = ...,
        completion_batch_size: int = ...,
        prefill_batch_size: int = ...,
        prefill_step_size: int = ...,
    ) -> None: ...
    def insert(
-        self, prompts, max_tokens: Union[List[int], int, None] = ...
-    ):  # -> list[Any]:
-        ...
-    def stats(self):  # -> BatchStats:
-        ...
-    def next(self):  # -> list[Any]:
-        ...
+        self, prompts: List[List[int]], max_tokens: Union[List[int], int, None] = ...
+    ) -> List[int]: ...
+    def stats(self) -> BatchStats: ...
+    def next(self) -> List[Response]: ...

 def batch_generate(
    model,
--- a/.mlx_typings/mlx_lm/tokenizer_utils.pyi
+++ b/.mlx_typings/mlx_lm/tokenizer_utils.pyi
@@ -39,11 +39,11 @@ class StreamingDetokenizer:
    """

    __slots__ = ...
-    def reset(self): ...
-    def add_token(self, token): ...
-    def finalize(self): ...
+    def reset(self) -> None: ...
+    def add_token(self, token: int) -> None: ...
+    def finalize(self) -> None: ...
    @property
-    def last_segment(self):
+    def last_segment(self) -> str:
        """Return the last segment of readable text since last time this property was accessed."""

 class NaiveStreamingDetokenizer(StreamingDetokenizer):
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -116,10 +116,49 @@ From .cursorrules:
 - Catch exceptions only where you can handle them meaningfully
 - Use `@final` and immutability wherever applicable

+## Model Storage
+
+Downloaded models are stored in `~/.exo/models/` (not the standard HuggingFace cache location).
+
+## Creating Model Instances via API
+
+When testing with the API, you must first create a model instance before sending chat completions:
+
+```bash
+# 1. Get instance previews for a model
+curl "http://localhost:52415/instance/previews?model_id=llama-3.2-1b"
+
+# 2. Create an instance from the first valid preview
+INSTANCE=$(curl -s "http://localhost:52415/instance/previews?model_id=llama-3.2-1b" | jq -c '.previews[] | select(.error == null) | .instance' | head -n1)
+curl -X POST http://localhost:52415/instance -H 'Content-Type: application/json' -d "{\"instance\": $INSTANCE}"
+
+# 3. Wait for the runner to become ready (check logs for "runner ready")
+
+# 4. Send chat completions using the full model ID
+curl -X POST http://localhost:52415/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "mlx-community/Llama-3.2-1B-Instruct-4bit", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'
+```
+
+## Logs
+
+Exo logs are stored in `~/.exo/exo.log`. This is useful for debugging runner crashes and distributed issues.
+
 ## Testing

 Tests use pytest-asyncio with `asyncio_mode = "auto"`. Tests are in `tests/` subdirectories alongside the code they test. The `EXO_TESTS=1` env var is set during tests.

+### Distributed Testing
+
+When running distributed tests across multiple machines, use `EXO_LIBP2P_NAMESPACE` to isolate your test cluster from other exo instances on the same network:
+
+```bash
+# On each machine in the test cluster, use the same unique namespace
+EXO_LIBP2P_NAMESPACE=my-test-cluster uv run exo
+```
+
+This prevents your test cluster from discovering and interfering with production or other developers' exo clusters.
+
 ## Dashboard UI Testing & Screenshots

 ### Building and Running the Dashboard
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -141,6 +141,12 @@ version = "0.3.9"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "76a2e8124351fda1ef8aaaa3bbd7ebbcb486bbcd4225aca0aa0d84bb2db8fecb"

+[[package]]
+name = "arrayvec"
+version = "0.7.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50"
+
 [[package]]
 name = "asn1-rs"
 version = "0.7.1"
@@ -298,6 +304,19 @@ version = "1.8.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "55248b47b0caf0546f7988906588779981c43bb1bc9d0c44087278f80cdb44ba"

+[[package]]
+name = "bigdecimal"
+version = "0.4.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "560f42649de9fa436b73517378a147ec21f6c997a546581df4b4b31677828934"
+dependencies = [
+ "autocfg",
+ "libm",
+ "num-bigint",
+ "num-integer",
+ "num-traits",
+]
+
 [[package]]
 name = "bimap"
 version = "0.6.3"
@@ -497,6 +516,15 @@ version = "0.4.3"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "2f421161cb492475f1661ddc9815a745a1c894592070661180fdec3d4872e9c3"

+[[package]]
+name = "convert_case"
+version = "0.10.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "633458d4ef8c78b72454de2d54fd6ab2e60f9e02be22f3c6104cdc8a4e0fceb9"
+dependencies = [
+ "unicode-segmentation",
+]
+
 [[package]]
 name = "core-foundation"
 version = "0.9.4"
@@ -718,6 +746,29 @@ dependencies = [
 "powerfmt",
 ]

+[[package]]
+name = "derive_more"
+version = "2.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "10b768e943bed7bf2cab53df09f4bc34bfd217cdb57d971e769874c9a6710618"
+dependencies = [
+ "derive_more-impl",
+]
+
+[[package]]
+name = "derive_more-impl"
+version = "2.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6d286bfdaf75e988b4a78e013ecd79c581e06399ab53fbacd2d916c2f904f30b"
+dependencies = [
+ "convert_case",
+ "proc-macro2",
+ "quote",
+ "rustc_version",
+ "syn 2.0.111",
+ "unicode-xid",
+]
+
 [[package]]
 name = "digest"
 version = "0.10.7"
@@ -888,17 +939,22 @@ name = "exo_pyo3_bindings"
 version = "0.0.1"
 dependencies = [
 "delegate",
+ "derive_more",
 "env_logger",
 "extend",
 "futures",
+ "impl-trait-for-tuples",
 "libp2p",
 "log",
 "networking",
+ "once_cell",
 "pin-project",
 "pyo3",
 "pyo3-async-runtimes",
 "pyo3-log",
 "pyo3-stub-gen",
+ "thiserror 2.0.17",
+ "thread_local",
 "tokio",
 "util",
 ]
@@ -1584,6 +1640,17 @@ dependencies = [
 "xmltree",
 ]

+[[package]]
+name = "impl-trait-for-tuples"
+version = "0.2.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a0eb5a3343abf848c0984fe4604b2b105da9539376e24fc0a3b0007411ae4fd9"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.111",
+]
+
 [[package]]
 name = "indexmap"
 version = "2.12.1"
@@ -1762,6 +1829,12 @@ version = "0.2.178"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "37c93d8daa9d8a012fd8ab92f088405fb202ea0b6ab73ee2482ae66af4f42091"

+[[package]]
+name = "libm"
+version = "0.2.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f9fbbcab51052fe104eb5e5d351cf728d30a5be1fe14d9be8a3b097481fb97de"
+
 [[package]]
 name = "libp2p"
 version = "0.56.0"
@@ -2751,13 +2824,16 @@ name = "networking"
 version = "0.0.1"
 dependencies = [
 "delegate",
+ "derive_more",
 "either",
 "extend",
 "futures",
 "futures-timer",
+ "impl-trait-for-tuples",
 "keccak-const",
 "libp2p",
 "log",
+ "thiserror 2.0.17",
 "tokio",
 "tracing-subscriber",
 "util",
@@ -2842,6 +2918,17 @@ dependencies = [
 "num-traits",
 ]

+[[package]]
+name = "num-rational"
+version = "0.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f83d14da390562dca69fc84082e73e548e1ad308d24accdedd2720017cb37824"
+dependencies = [
+ "num-bigint",
+ "num-integer",
+ "num-traits",
+]
+
 [[package]]
 name = "num-traits"
 version = "0.2.19"
@@ -3192,14 +3279,28 @@ version = "0.27.2"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "ab53c047fcd1a1d2a8820fe84f05d6be69e9526be40cb03b73f86b6b03e6d87d"
 dependencies = [
+ "bigdecimal",
+ "either",
+ "hashbrown 0.16.1",
+ "indexmap",
 "indoc",
+ "inventory",
 "libc",
+ "lock_api",
 "memoffset",
+ "num-bigint",
+ "num-complex",
+ "num-rational",
+ "num-traits",
 "once_cell",
+ "ordered-float",
+ "parking_lot",
 "portable-atomic",
 "pyo3-build-config",
 "pyo3-ffi",
 "pyo3-macros",
+ "rust_decimal",
+ "smallvec",
 "unindent",
 ]

@@ -3640,6 +3741,16 @@ dependencies = [
 "tokio",
 ]

+[[package]]
+name = "rust_decimal"
+version = "1.39.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "35affe401787a9bd846712274d97654355d21b2a2c092a3139aabe31e9022282"
+dependencies = [
+ "arrayvec",
+ "num-traits",
+]
+
 [[package]]
 name = "rustc-hash"
 version = "1.1.0"
@@ -4504,12 +4615,24 @@ version = "1.0.22"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "9312f7c4f6ff9069b165498234ce8be658059c6728633667c526e27dc2cf1df5"

+[[package]]
+name = "unicode-segmentation"
+version = "1.12.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f6ccf251212114b54433ec949fd6a7841275f9ada20dddd2f29e9ceea4501493"
+
 [[package]]
 name = "unicode-width"
 version = "0.2.2"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254"

+[[package]]
+name = "unicode-xid"
+version = "0.2.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853"
+
 [[package]]
 name = "unicode_names2"
 version = "1.3.0"
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -26,21 +26,49 @@ opt-level = 3
 networking = { path = "rust/networking" }
 util = { path = "rust/util" }

+# Proc-macro authoring tools
+syn = "2.0"
+quote = "1.0"
+proc-macro2 = "1.0"
+darling = "0.20"
+
 # Macro dependecies
 extend = "1.2"
 delegate = "0.13"
+impl-trait-for-tuples = "0.2"
+clap = "4.5"
+derive_more = { version = "2.0.1", features = ["display"] }
 pin-project = "1"

 # Utility dependencies
+itertools = "0.14"
+thiserror = "2"
+internment = "0.8"
+recursion = "0.5"
+regex = "1.11"
+once_cell = "1.21"
+thread_local = "1.1"
+bon = "3.4"
+generativity = "1.1"
+anyhow = "1.0"
 keccak-const = "0.2"

+# Functional generics/lenses frameworks
+frunk_core = "0.4"
+frunk = "0.4"
+frunk_utils = "0.2"
+frunk-enum-core = "0.3"
+
 # Async dependencies
 tokio = "1.46"
 futures = "0.3"
+futures-util = "0.3"
 futures-timer = "3.0"

 # Data structures
 either = "1.15"
+ordered-float = "5.0"
+ahash = "0.8"

 # Tracing/logging
 log = "0.4"
--- a/MISSED_THINGS.md
+++ b/MISSED_THINGS.md
@@ -5,21 +5,21 @@
 [X] Fetching download status of all models on start
 [X] Deduplication of tasks in plan_step.
 [X] resolve_allow_patterns should just be wildcard now.
-[X] no mx_barrier in genreate.py mlx_generate at the end.
+[] no mx_barrier in genreate.py mlx_generate at the end.
 [] cache assertion not needed in auto_parallel.py PipelineLastLayer.
-[X] GPTOSS support dropped in auto_parallel.py.
-[X] sharding changed "all-to-sharded" became _all_to_sharded in auto_parallel.py.
-[X] same as above with "sharded-to-all" became _sharded_to_all in auto_parallel.py.
-[X] Dropped support for Ministral3Model, DeepseekV32Model, Glm4MoeModel, Qwen3NextModel, GptOssMode in auto_parallel.py.
+[] GPTOSS support dropped in auto_parallel.py.
+[] sharding changed "all-to-sharded" became _all_to_sharded in auto_parallel.py.
+[] same as above with "sharded-to-all" became _sharded_to_all in auto_parallel.py.
+[] Dropped support for Ministral3Model, DeepseekV32Model, Glm4MoeModel, Qwen3NextModel, GptOssMode in auto_parallel.py.
 [] Dropped prefill/decode code in auto_parallel.py and utils_mlx.py.
 [X] KV_CACHE_BITS should be None to disable quantized KV cache.
-[X] Dropped _set_nofile_limit in utils_mlx.py.
-[X] We have group optional in load_mlx_items in utils_mlx.py.
-[X] Dropped add_missing_chat_templates for GptOss in load_mlx_items in utils_mlx.py.
-[X] Dropped model.make_cache in make_kv_cache in utils_mlx.py.
+[] Dropped _set_nofile_limit in utils_mlx.py.
+[] We have group optional in load_mlx_items in utils_mlx.py.
+[] Dropped add_missing_chat_templates for GptOss in load_mlx_items in utils_mlx.py.
+[] Dropped model.make_cache in make_kv_cache in utils_mlx.py.
 [X] We put cache limit back in utils_mlx.py.
-[X] topology.py remove_node removes the connections after checking if node is is in self._node_id_to_rx_id_map. on beta_1 it checks after, so would remove stale connections I guess?
-[X] Missing Glm 4.7 model cards (this isn't ready yet but should be picked up, probably create an issue... the blocker is transforemrs version doesn't support the tokenizer for Glm 4.7. rc-1 does but we can't upgrade as it breaks other things.)
+[] topology.py remove_node removes the connections after checking if node is is in self._node_id_to_rx_id_map. on beta_1 it checks after, so would remove stale connections I guess?
+[] Missing Glm 4.7 model cards (this isn't ready yet but should be picked up, probably create an issue... the blocker is transforemrs version doesn't support the tokenizer for Glm 4.7. rc-1 does but we can't upgrade as it breaks other things.)
 [] try-except in _command_processor only excepts ValueError. This was silently failing leading to un-debuggable errors (we had a KeyError that was happening ). Changed this to catch Exception instead of ValueError. See exo-v2 89ae38405e0052e3c22405daf094b065878aa873 and fb99fea69b5a39017efc90c5dad0072e677455f0.
 [X] In placement.py, place_instance no longer looks at model_meta.supports_tensor and check if this tensor parallel number of nodes is supported by the model's tensor dimensions.
 [X] In placement.py, place_instanec, we no longer have the special case to exclude DeepSeek v3.1 pipeline parallel (it doesn't work).
--- a/app/EXO/EXO/ExoProcessController.swift
+++ b/app/EXO/EXO/ExoProcessController.swift
@@ -126,37 +126,11 @@ final class ExoProcessController: ObservableObject {
            return
        }
        process.terminationHandler = nil
-        status = .stopped
-
-        guard process.isRunning else {
-            self.process = nil
-            return
+        if process.isRunning {
+            process.terminate()
        }
-
-        let proc = process
        self.process = nil
-
-        Task.detached {
-            proc.interrupt()
-
-            for _ in 0..<50 {
-                if !proc.isRunning { return }
-                try? await Task.sleep(nanoseconds: 100_000_000)
-            }
-
-            if proc.isRunning {
-                proc.terminate()
-            }
-
-            for _ in 0..<30 {
-                if !proc.isRunning { return }
-                try? await Task.sleep(nanoseconds: 100_000_000)
-            }
-
-            if proc.isRunning {
-                kill(proc.processIdentifier, SIGKILL)
-            }
-        }
+        status = .stopped
    }

    func restart() {
--- a/bench/bench.toml
+++ b/bench/bench.toml
@@ -1,7 +0,0 @@
-# Canary benchmark manifest
-#
-# Lists the suite files to include. Each file defines benchmarks
-# with shared constraints, topology, and default args.
-include = [
-    "single-m3-ultra.toml",
-]
--- a/bench/exo_bench.py
+++ b/bench/exo_bench.py
@@ -288,151 +288,6 @@ def resolve_model_short_id(client: ExoClient, model_arg: str) -> tuple[str, str]
    raise ValueError(f"Model not found in /models: {model_arg}")


-def run_planning_phase(
-    client: ExoClient,
-    full_model_id: str,
-    preview: dict[str, Any],
-    danger_delete: bool,
-    timeout: float,
-    settle_deadline: float | None,
-) -> None:
-    """Check disk space and ensure model is downloaded before benchmarking."""
-    # Get model size from /models
-    models = client.request_json("GET", "/models") or {}
-    model_bytes = 0
-    for m in models.get("data", []):
-        if m.get("hugging_face_id") == full_model_id:
-            model_bytes = m.get("storage_size_megabytes", 0) * 1024 * 1024
-            break
-
-    if not model_bytes:
-        logger.warning(
-            f"Could not determine size for {full_model_id}, skipping disk check"
-        )
-        return
-
-    # Get nodes from preview
-    inner = unwrap_instance(preview["instance"])
-    node_ids = list(inner["shardAssignments"]["nodeToRunner"].keys())
-    runner_to_shard = inner["shardAssignments"]["runnerToShard"]
-
-    state = client.request_json("GET", "/state")
-    downloads = state.get("downloads", {})
-    node_disk = state.get("nodeDisk", {})
-
-    for node_id in node_ids:
-        node_downloads = downloads.get(node_id, [])
-
-        # Check if model already downloaded on this node
-        already_downloaded = any(
-            "DownloadCompleted" in p
-            and unwrap_instance(p["DownloadCompleted"]["shardMetadata"])["modelCard"][
-                "modelId"
-            ]
-            == full_model_id
-            for p in node_downloads
-        )
-        if already_downloaded:
-            continue
-
-        # Wait for disk info if settle_deadline is set
-        disk_info = node_disk.get(node_id, {})
-        backoff = _SETTLE_INITIAL_BACKOFF_S
-        while not disk_info and settle_deadline and time.monotonic() < settle_deadline:
-            remaining = settle_deadline - time.monotonic()
-            logger.info(
-                f"Waiting for disk info on {node_id} ({remaining:.0f}s remaining)..."
-            )
-            time.sleep(min(backoff, remaining))
-            backoff = min(backoff * _SETTLE_BACKOFF_MULTIPLIER, _SETTLE_MAX_BACKOFF_S)
-            state = client.request_json("GET", "/state")
-            node_disk = state.get("nodeDisk", {})
-            disk_info = node_disk.get(node_id, {})
-
-        if not disk_info:
-            logger.warning(f"No disk info for {node_id}, skipping space check")
-            continue
-
-        avail = disk_info.get("available", {}).get("inBytes", 0)
-        if avail >= model_bytes:
-            continue
-
-        if not danger_delete:
-            raise RuntimeError(
-                f"Insufficient disk on {node_id}: need {model_bytes // (1024**3)}GB, "
-                f"have {avail // (1024**3)}GB. Use --danger-delete-downloads to free space."
-            )
-
-        # Delete from smallest to largest
-        completed = [
-            (
-                unwrap_instance(p["DownloadCompleted"]["shardMetadata"])["modelCard"][
-                    "modelId"
-                ],
-                p["DownloadCompleted"]["totalBytes"]["inBytes"],
-            )
-            for p in node_downloads
-            if "DownloadCompleted" in p
-        ]
-        for del_model, size in sorted(completed, key=lambda x: x[1]):
-            logger.info(f"Deleting {del_model} from {node_id} ({size // (1024**2)}MB)")
-            client.request_json("DELETE", f"/download/{node_id}/{del_model}")
-            avail += size
-            if avail >= model_bytes:
-                break
-
-        if avail < model_bytes:
-            raise RuntimeError(f"Could not free enough space on {node_id}")
-
-    # Start downloads (idempotent)
-    for node_id in node_ids:
-        runner_id = inner["shardAssignments"]["nodeToRunner"][node_id]
-        shard = runner_to_shard[runner_id]
-        client.request_json(
-            "POST",
-            "/download/start",
-            body={
-                "targetNodeId": node_id,
-                "shardMetadata": shard,
-            },
-        )
-        logger.info(f"Started download on {node_id}")
-
-    # Wait for downloads
-    start = time.time()
-    while time.time() - start < timeout:
-        state = client.request_json("GET", "/state")
-        downloads = state.get("downloads", {})
-        all_done = True
-        for node_id in node_ids:
-            done = any(
-                "DownloadCompleted" in p
-                and unwrap_instance(p["DownloadCompleted"]["shardMetadata"])[
-                    "modelCard"
-                ]["modelId"]
-                == full_model_id
-                for p in downloads.get(node_id, [])
-            )
-            failed = [
-                p["DownloadFailed"]["errorMessage"]
-                for p in downloads.get(node_id, [])
-                if "DownloadFailed" in p
-                and unwrap_instance(p["DownloadFailed"]["shardMetadata"])["modelCard"][
-                    "modelId"
-                ]
-                == full_model_id
-            ]
-            if failed:
-                raise RuntimeError(f"Download failed on {node_id}: {failed[0]}")
-            if not done:
-                all_done = False
-        if all_done:
-            return
-        time.sleep(1)
-
-    raise TimeoutError("Downloads did not complete in time")
-
-
 def placement_filter(instance_meta: str, wanted: str) -> bool:
    s = (instance_meta or "").lower()
    if wanted == "both":
@@ -680,11 +535,6 @@ def main() -> int:
        default=0,
        help="Max seconds to wait for the cluster to produce valid placements (0 = try once).",
    )
-    ap.add_argument(
-        "--danger-delete-downloads",
-        action="store_true",
-        help="Delete existing models from smallest to largest to make room for benchmark model.",
-    )
    args = ap.parse_args()

    pp_list = parse_int_list(args.pp)
@@ -719,16 +569,13 @@ def main() -> int:
        logger.error("[exo-bench] tokenizer usable but prompt sizing failed")
        raise

-    settle_deadline = (
-        time.monotonic() + args.settle_timeout if args.settle_timeout > 0 else None
-    )
-
    selected = fetch_and_filter_placements(client, full_model_id, args)

-    if not selected and settle_deadline:
+    if not selected and args.settle_timeout > 0:
        backoff = _SETTLE_INITIAL_BACKOFF_S
-        while not selected and time.monotonic() < settle_deadline:
-            remaining = settle_deadline - time.monotonic()
+        deadline = time.monotonic() + args.settle_timeout
+        while not selected and time.monotonic() < deadline:
+            remaining = deadline - time.monotonic()
            logger.warning(
                f"No valid placements yet (cluster may still be settling). "
                f"Retrying in {backoff:.1f}s ({remaining:.0f}s remaining)..."
@@ -760,16 +607,6 @@ def main() -> int:
    if args.dry_run:
        return 0

-    logger.info("Planning phase: checking downloads...")
-    run_planning_phase(
-        client,
-        full_model_id,
-        selected[0],
-        args.danger_delete_downloads,
-        args.timeout,
-        settle_deadline,
-    )
-
    all_rows: list[dict[str, Any]] = []

    for preview in selected:
--- a/bench/single-m3-ultra.toml
+++ b/bench/single-m3-ultra.toml
@@ -1,189 +0,0 @@
-# Single-node M3 Ultra benchmarks
-#
-# Shared constraints applied to ALL benchmarks in this file.
-constraints = [
-    "All(MacOsBuild(=25D125))",
-    "Hosts(=1)",
-    "All(Chip(m3_ultra))",
-    "All(GpuCores(=80))",
-]
-
-[topology]
-type = "none"
-
-# Default args merged into each benchmark's args (benchmark-level args win).
-[defaults]
-pp = [512, 2048, 8192, 16384]
-tg = 128
-
-[[benchmark]]
-model = "mlx-community/Meta-Llama-3.1-70B-Instruct-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/gpt-oss-120b-MXFP4-Q8"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.7-Flash-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Coder-Next-6bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-30B-A3B-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-0.6B-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-0.6B-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Llama-3.2-1B-Instruct-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Llama-3.2-3B-Instruct-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Llama-3.2-3B-Instruct-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Meta-Llama-3.1-8B-Instruct-bf16"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/gpt-oss-20b-MXFP4-Q8"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-30B-A3B-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.7-Flash-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.7-Flash-5bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.7-Flash-6bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Llama-3.3-70B-Instruct-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Coder-Next-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Coder-Next-5bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Coder-Next-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Next-80B-A3B-Thinking-8bit"
-extra_constraints = ["All(Memory(>=96GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Llama-3.3-70B-Instruct-8bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/llama-3.3-70b-instruct-fp16"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.5-Air-8bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.5-Air-bf16"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.7-4bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/MiniMax-M2.1-3bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/MiniMax-M2.1-8bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-235B-A22B-Instruct-2507-4bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Coder-Next-bf16"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Step-3.5-Flash-4bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Step-3.5-Flash-6bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Step-3.5-Flash-8Bit"
-extra_constraints = ["All(Memory(>=256GiB))"]
-
-[[benchmark]]
-model = "mlx-community/DeepSeek-V3.1-4bit"
-extra_constraints = ["All(Memory(>=512GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.7-6bit"
-extra_constraints = ["All(Memory(>=512GiB))"]
-
-[[benchmark]]
-model = "mlx-community/GLM-4.7-8bit-gs32"
-extra_constraints = ["All(Memory(>=512GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-235B-A22B-Instruct-2507-8bit"
-extra_constraints = ["All(Memory(>=512GiB))"]
-
-[[benchmark]]
-model = "mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit"
-extra_constraints = ["All(Memory(>=512GiB))"]
--- a/conftest.py
+++ b/conftest.py
@@ -0,0 +1 @@
+collect_ignore = ["tests/start_distributed_test.py"]
--- a/dashboard/src/lib/components/ChatForm.svelte
+++ b/dashboard/src/lib/components/ChatForm.svelte
@@ -265,7 +265,6 @@

  function handleSubmit() {
    if ((!message.trim() && uploadedFiles.length === 0) || loading) return;
-    if (isEditOnlyWithoutImage) return;

    const content = message.trim();
    const files = [...uploadedFiles];
@@ -290,11 +289,7 @@
      if (imageFile.preview) {
        editImage(content, imageFile.preview);
      }
-    } else if (
-      currentModel &&
-      modelSupportsTextToImage(currentModel) &&
-      content
-    ) {
+    } else if (isImageModel() && content) {
      // Use image generation for text-to-image models
      generateImage(content);
    } else {
--- a/dashboard/src/lib/components/ChatMessages.svelte
+++ b/dashboard/src/lib/components/ChatMessages.svelte
@@ -225,7 +225,6 @@
  }

  function handleDeleteClick(messageId: string) {
-    if (loading) return;
    deleteConfirmId = messageId;
  }

@@ -256,7 +255,7 @@
 </script>

 <div class="flex flex-col gap-4 sm:gap-6 {className}">
-  {#each messageList as message, i (message.id)}
+  {#each messageList as message (message.id)}
    <div
      class="group flex {message.role === 'user'
        ? 'justify-end'
@@ -318,11 +317,9 @@
          <!-- Delete confirmation -->
          <div class="bg-red-500/10 border border-red-500/30 rounded-lg p-3">
            <p class="text-xs text-red-400 mb-3">
-              {#if i === messageList.length - 1}
-                Delete this message?
-              {:else}
-                Delete this message and all messages after it?
-              {/if}
+              Delete this message{message.role === "user"
+                ? " and all responses after it"
+                : ""}?
            </p>
            <div class="flex gap-2 justify-end">
              <button
@@ -754,13 +751,8 @@
            <!-- Delete button -->
            <button
              onclick={() => handleDeleteClick(message.id)}
-              disabled={loading}
-              class="p-1.5 transition-colors rounded {loading
-                ? 'text-exo-light-gray/30 cursor-not-allowed'
-                : 'text-exo-light-gray hover:text-red-400 hover:bg-red-500/10 cursor-pointer'}"
-              title={loading
-                ? "Cannot delete while generating"
-                : "Delete message"}
+              class="p-1.5 text-exo-light-gray hover:text-red-400 transition-colors rounded hover:bg-red-500/10 cursor-pointer"
+              title="Delete message"
            >
              <svg
                class="w-3.5 h-3.5"
--- a/dashboard/src/routes/downloads/+page.svelte
+++ b/dashboard/src/routes/downloads/+page.svelte
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -132,7 +132,7 @@ markers = [
 env = [
  "EXO_TESTS=1"
 ]
-addopts = "-m 'not slow' --ignore=tests/start_distributed_test.py"
+addopts = "-m 'not slow'"
 filterwarnings = [
    "ignore:builtin type Swig:DeprecationWarning",
 ]
--- a/python/parts.nix
+++ b/python/parts.nix
@@ -14,9 +14,7 @@

      # Override overlay to inject Nix-built components
      exoOverlay = final: prev: {
-        # Replace workspace exo_pyo3_bindings with Nix-built wheel.
-        # Preserve passthru so mkVirtualEnv can resolve dependency groups.
-        # Copy .pyi stub + py.typed marker so basedpyright can find the types.
+        # Replace workspace exo_pyo3_bindings with Nix-built wheel
        exo-pyo3-bindings = pkgs.stdenv.mkDerivation {
          pname = "exo-pyo3-bindings";
          version = "0.1.0";
@@ -24,12 +22,6 @@
          # Install from pre-built wheel
          nativeBuildInputs = [ final.pyprojectWheelHook ];
          dontStrip = true;
-          passthru = prev.exo-pyo3-bindings.passthru or { };
-          postInstall = ''
-            local siteDir=$out/${final.python.sitePackages}/exo_pyo3_bindings
-            cp ${inputs.self}/rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi $siteDir/
-            touch $siteDir/py.typed
-          '';
        };
      };

@@ -37,32 +29,17 @@

      # Overlay to provide build systems and custom packages
      buildSystemsOverlay = final: prev: {
+        # Use our pure Nix-built MLX with Metal support
+        mlx = self'.packages.mlx;
+
        # mlx-lm is a git dependency that needs setuptools
        mlx-lm = prev.mlx-lm.overrideAttrs (old: {
          nativeBuildInputs = (old.nativeBuildInputs or [ ]) ++ [
            final.setuptools
          ];
        });
-      } // lib.optionalAttrs pkgs.stdenv.hostPlatform.isDarwin {
-        # Use our pure Nix-built MLX with Metal support (macOS only)
-        mlx = self'.packages.mlx;
      };

-      # Additional overlay for Linux-specific fixes (type checking env).
-      # Native wheels have shared lib dependencies we don't need at type-check time.
-      linuxOverlay = final: prev:
-        let
-          ignoreMissing = drv: drv.overrideAttrs { autoPatchelfIgnoreMissingDeps = [ "*" ]; };
-          nvidiaPackages = lib.filterAttrs (name: _: lib.hasPrefix "nvidia-" name) prev;
-        in
-        lib.optionalAttrs pkgs.stdenv.hostPlatform.isLinux (
-          (lib.mapAttrs (_: ignoreMissing) nvidiaPackages) // {
-            mlx = ignoreMissing prev.mlx;
-            torch = ignoreMissing prev.torch;
-            triton = ignoreMissing prev.triton;
-          }
-        );
-
      pythonSet = (pkgs.callPackage inputs.pyproject-nix.build.packages {
        inherit python;
      }).overrideScope (
@@ -71,7 +48,6 @@
          overlay
          exoOverlay
          buildSystemsOverlay
-          linuxOverlay
        ]
      );
      exoVenv = pythonSet.mkVirtualEnv "exo-env" workspace.deps.default;
@@ -142,21 +118,6 @@
          ${pkgs.ruff}/bin/ruff check ${inputs.self}
          touch $out
        '';
-
-        # Hermetic basedpyright type checking
-        typecheck = pkgs.runCommand "typecheck"
-          {
-            nativeBuildInputs = [
-              testVenv
-              pkgs.basedpyright
-            ];
-          }
-          ''
-            cd ${inputs.self}
-            export HOME=$TMPDIR
-            basedpyright --pythonpath ${testVenv}/bin/python
-            touch $out
-          '';
      };
    };
 }
--- a/rust/exo_pyo3_bindings/Cargo.toml
+++ b/rust/exo_pyo3_bindings/Cargo.toml
@@ -25,17 +25,17 @@ workspace = true
 networking = { workspace = true }

 # interop
-pyo3 = { version = "0.27.2", features = [
-    # "abi3-py313", # tells pyo3 (and maturin) to build using the stable ABI with minimum Python version 3.13
+pyo3 = { version = "0.27.1", features = [
+    # "abi3-py311", # tells pyo3 (and maturin) to build using the stable ABI with minimum Python version 3.11
    "nightly", # enables better-supported GIL integration
    "experimental-async", # async support in #[pyfunction] & #[pymethods]
    #"experimental-inspect", # inspection of generated binary => easier to automate type-hint generation
    #"py-clone", # adding Clone-ing of `Py<T>` without GIL (may cause panics - remove if panics happen)
-    # "multiple-pymethods", # allows multiple #[pymethods] sections per class
+    "multiple-pymethods", # allows multiple #[pymethods] sections per class

    # integrations with other libraries
-    # "arc_lock", "bigdecimal", "either", "hashbrown", "indexmap", "num-bigint", "num-complex", "num-rational",
-    # "ordered-float", "rust_decimal", "smallvec",
+    "arc_lock", "bigdecimal", "either", "hashbrown", "indexmap", "num-bigint", "num-complex", "num-rational",
+    "ordered-float", "rust_decimal", "smallvec",
    # "anyhow", "chrono", "chrono-local", "chrono-tz", "eyre", "jiff-02", "lock_api", "parking-lot", "time",  "serde",
 ] }
 pyo3-stub-gen = { version = "0.17.2" }
@@ -45,6 +45,8 @@ pyo3-log = "0.13.2"
 # macro dependencies
 extend = { workspace = true }
 delegate = { workspace = true }
+impl-trait-for-tuples = { workspace = true }
+derive_more = { workspace = true }
 pin-project = { workspace = true }

 # async runtime
@@ -52,11 +54,24 @@ tokio = { workspace = true, features = ["full", "tracing"] }
 futures = { workspace = true }

 # utility dependencies
+once_cell = "1.21.3"
+thread_local = "1.1.9"
 util = { workspace = true }
+thiserror = { workspace = true }
+#internment = { workspace = true }
+#recursion = { workspace = true }
+#generativity = { workspace = true }
+#itertools = { workspace = true }
+

 # Tracing
+#tracing = "0.1"
+#tracing-subscriber = "0.3"
+#console-subscriber = "0.1.5"
+#tracing-log = "0.2.0"
 log = { workspace = true }
 env_logger = "0.11"

+
 # Networking
 libp2p = { workspace = true, features = ["full"] }
--- a/rust/exo_pyo3_bindings/src/allow_threading.rs
+++ b/rust/exo_pyo3_bindings/src/allow_threading.rs
@@ -6,7 +6,7 @@ use pyo3::marker::Ungil;
 use pyo3::prelude::*;
 use std::{
    future::Future,
-    pin::Pin,
+    pin::{Pin, pin},
    task::{Context, Poll},
 };

@@ -33,6 +33,8 @@ where

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let waker = cx.waker();
-        Python::attach(|py| py.detach(|| self.project().0.poll(&mut Context::from_waker(waker))))
+        Python::with_gil(|py| {
+            py.allow_threads(|| self.project().0.poll(&mut Context::from_waker(waker)))
+        })
    }
 }
--- a/rust/exo_pyo3_bindings/src/examples/mod.rs
+++ b/rust/exo_pyo3_bindings/src/examples/mod.rs
@@ -0,0 +1,240 @@
+//! This module exists to hold examples of some pyo3 patterns that may be too complex to
+//! re-create from scratch, but too inhomogenous to create an abstraction/wrapper around.
+//!
+//! Pattern examples include:
+//!  - Async task handles: with GC-integrated cleanup
+//!  - Sync/async callbacks from python: with propper eventloop handling
+//!
+//! Mutability pattern: https://pyo3.rs/v0.26.0/async-await.html#send--static-constraint
+//!  - Store mutable fields in tokio's `Mutex<T>`
+//!  - For async code: take `&self` and `.lock().await`
+//!  - For sync code: take `&mut self` and `.get_mut()`
+
+use crate::ext::{PyResultExt as _, ResultExt as _, TokioRuntimeExt as _};
+use futures::FutureExt as _;
+use futures::future::BoxFuture;
+use pyo3::exceptions::PyRuntimeError;
+use pyo3::prelude::{PyModule, PyModuleMethods as _};
+use pyo3::{
+    Bound, Py, PyAny, PyErr, PyResult, PyTraverseError, PyVisit, Python, pyclass, pymethods,
+};
+use std::time::Duration;
+use tokio::sync::mpsc;
+use tokio::sync::mpsc::error::TryRecvError;
+
+fn needs_tokio_runtime() {
+    tokio::runtime::Handle::current();
+}
+
+type SyncCallback = Box<dyn Fn() + Send + Sync>;
+type AsyncCallback = Box<dyn Fn() -> BoxFuture<'static, ()> + Send + Sync>;
+
+enum AsyncTaskMessage {
+    SyncCallback(SyncCallback),
+    AsyncCallback(AsyncCallback),
+}
+
+async fn async_task(
+    sender: mpsc::UnboundedSender<()>,
+    mut receiver: mpsc::UnboundedReceiver<AsyncTaskMessage>,
+) {
+    log::info!("RUST: async task started");
+
+    // task state
+    let mut interval = tokio::time::interval(Duration::from_secs(1));
+
+    let mut sync_cbs: Vec<SyncCallback> = vec![];
+    let mut async_cbs: Vec<AsyncCallback> = vec![];
+
+    loop {
+        tokio::select! {
+            // handle incoming messages from task-handle
+            message = receiver.recv() => {
+                // handle closed channel by exiting
+                let Some(message) = message else {
+                    log::info!("RUST: channel closed");
+                    break;
+                };
+
+                // dispatch incoming event
+                match message {
+                    AsyncTaskMessage::SyncCallback(cb) => {
+                        sync_cbs.push(cb);
+                    }
+                    AsyncTaskMessage::AsyncCallback(cb) => {
+                        async_cbs.push(cb);
+                    }
+                }
+            }
+
+            // handle all other events
+            _ = interval.tick() => {
+                log::info!("RUST: async task tick");
+
+                // call back all sync callbacks
+                for cb in &sync_cbs {
+                    cb();
+                }
+
+                // call back all async callbacks
+                for cb in &async_cbs {
+                    cb().await;
+                }
+
+                // send event on unbounded channel
+                sender.send(()).expect("handle receiver cannot be closed/dropped");
+            }
+        }
+    }
+
+    log::info!("RUST: async task stopped");
+}
+
+// #[gen_stub_pyclass]
+#[pyclass(name = "AsyncTaskHandle")]
+#[derive(Debug)]
+struct PyAsyncTaskHandle {
+    sender: Option<mpsc::UnboundedSender<AsyncTaskMessage>>,
+    receiver: mpsc::UnboundedReceiver<()>,
+}
+
+#[allow(clippy::expect_used)]
+impl PyAsyncTaskHandle {
+    const fn sender(&self) -> &mpsc::UnboundedSender<AsyncTaskMessage> {
+        self.sender
+            .as_ref()
+            .expect("The sender should only be None after de-initialization.")
+    }
+
+    const fn sender_mut(&mut self) -> &mpsc::UnboundedSender<AsyncTaskMessage> {
+        self.sender
+            .as_mut()
+            .expect("The sender should only be None after de-initialization.")
+    }
+
+    const fn new(
+        sender: mpsc::UnboundedSender<AsyncTaskMessage>,
+        receiver: mpsc::UnboundedReceiver<()>,
+    ) -> Self {
+        Self {
+            sender: Some(sender),
+            receiver,
+        }
+    }
+}
+
+// #[gen_stub_pymethods]
+#[pymethods]
+impl PyAsyncTaskHandle {
+    #[new]
+    fn py_new(py: Python<'_>) -> PyResult<Self> {
+        use pyo3_async_runtimes::tokio::get_runtime;
+
+        // create communication channel TOWARDS our task
+        let (h_sender, t_receiver) = mpsc::unbounded_channel::<AsyncTaskMessage>();
+
+        // create communication channel FROM our task
+        let (t_sender, h_receiver) = mpsc::unbounded_channel::<()>();
+
+        // perform necessary setup within tokio context - or it crashes
+        let () = get_runtime().block_on(async { needs_tokio_runtime() });
+
+        // spawn tokio task with this thread's task-locals - without this, async callbacks on the new threads will not work!!
+        _ = get_runtime().spawn_with_scope(py, async move {
+            async_task(t_sender, t_receiver).await;
+        });
+        Ok(Self::new(h_sender, h_receiver))
+    }
+
+    /// NOTE: exceptions in callbacks are silently ignored until end of execution
+    fn add_sync_callback(
+        &self,
+        // #[gen_stub(override_type(
+        //     type_repr="collections.abc.Callable[[], None]",
+        //     imports=("collections.abc")
+        // ))]
+        callback: Py<PyAny>,
+    ) -> PyResult<()> {
+        // blocking call to async method -> can do non-blocking if needed
+        self.sender()
+            .send(AsyncTaskMessage::SyncCallback(Box::new(move || {
+                _ = Python::with_gil(|py| callback.call0(py).write_unraisable_with(py));
+            })))
+            .pyerr()?;
+        Ok(())
+    }
+
+    /// NOTE: exceptions in callbacks are silently ignored until end of execution
+    fn add_async_callback(
+        &self,
+        // #[gen_stub(override_type(
+        //     type_repr="collections.abc.Callable[[], collections.abc.Awaitable[None]]",
+        //     imports=("collections.abc")
+        // ))]
+        callback: Py<PyAny>,
+    ) -> PyResult<()> {
+        // blocking call to async method -> can do non-blocking if needed
+        self.sender()
+            .send(AsyncTaskMessage::AsyncCallback(Box::new(move || {
+                let c = Python::with_gil(|py| callback.clone_ref(py));
+                async move {
+                    if let Some(f) = Python::with_gil(|py| {
+                        let coroutine = c.call0(py).write_unraisable_with(py)?;
+                        pyo3_async_runtimes::tokio::into_future(coroutine.into_bound(py))
+                            .write_unraisable_with(py)
+                    }) {
+                        _ = f.await.write_unraisable();
+                    }
+                }
+                .boxed()
+            })))
+            .pyerr()?;
+        Ok(())
+    }
+
+    async fn receive_unit(&mut self) -> PyResult<()> {
+        self.receiver
+            .recv()
+            .await
+            .ok_or(PyErr::new::<PyRuntimeError, _>(
+                "cannot receive unit on closed channel",
+            ))
+    }
+
+    fn drain_units(&mut self) -> PyResult<i32> {
+        let mut cnt = 0;
+        loop {
+            match self.receiver.try_recv() {
+                Err(TryRecvError::Disconnected) => {
+                    return Err(PyErr::new::<PyRuntimeError, _>(
+                        "cannot receive unit on closed channel",
+                    ));
+                }
+                Err(TryRecvError::Empty) => return Ok(cnt),
+                Ok(()) => {
+                    cnt += 1;
+                    continue;
+                }
+            }
+        }
+    }
+
+    // #[gen_stub(skip)]
+    const fn __traverse__(&self, _visit: PyVisit<'_>) -> Result<(), PyTraverseError> {
+        Ok(()) // This is needed purely so `__clear__` can work
+    }
+
+    // #[gen_stub(skip)]
+    fn __clear__(&mut self) {
+        // TODO: may or may not need to await a "kill-signal" oneshot channel message,
+        //       to ensure that the networking task is done BEFORE exiting the clear function...
+        //       but this may require GIL?? and it may not be safe to call GIL here??
+        self.sender = None; // Using Option<T> as a trick to force `sender` channel to be dropped
+    }
+}
+
+pub fn examples_submodule(m: &Bound<'_, PyModule>) -> PyResult<()> {
+    m.add_class::<PyAsyncTaskHandle>()?;
+
+    Ok(())
+}
--- a/rust/exo_pyo3_bindings/src/lib.rs
+++ b/rust/exo_pyo3_bindings/src/lib.rs
@@ -17,6 +17,7 @@

 extern crate core;
 mod allow_threading;
+mod examples;
 pub(crate) mod networking;
 pub(crate) mod pylibp2p;

@@ -24,6 +25,7 @@ use crate::networking::networking_submodule;
 use crate::pylibp2p::ident::ident_submodule;
 use crate::pylibp2p::multiaddr::multiaddr_submodule;
 use pyo3::prelude::PyModule;
+use pyo3::prelude::*;
 use pyo3::{Bound, PyResult, pyclass, pymodule};
 use pyo3_stub_gen::define_stub_info_gatherer;

@@ -34,10 +36,14 @@ pub(crate) mod r#const {

 /// Namespace for all the type/trait aliases used by this crate.
 pub(crate) mod alias {
+    use std::error::Error;
    use std::marker::Tuple;

    pub trait SendFn<Args: Tuple + Send + 'static, Output> =
        Fn<Args, Output = Output> + Send + 'static;
+
+    pub type AnyError = Box<dyn Error + Send + Sync + 'static>;
+    pub type AnyResult<T> = Result<T, AnyError>;
 }

 /// Namespace for crate-wide extension traits/methods
@@ -45,6 +51,7 @@ pub(crate) mod ext {
    use crate::allow_threading::AllowThreads;
    use extend::ext;
    use pyo3::exceptions::{PyConnectionError, PyRuntimeError};
+    use pyo3::marker::Ungil;
    use pyo3::types::PyBytes;
    use pyo3::{Py, PyErr, PyResult, Python};
    use tokio::runtime::Runtime;
@@ -55,7 +62,7 @@ pub(crate) mod ext {
    #[ext(pub, name = ByteArrayExt)]
    impl [u8] {
        fn pybytes(&self) -> Py<PyBytes> {
-            Python::attach(|py| PyBytes::new(py, self).unbind())
+            Python::with_gil(|py| PyBytes::new(py, self).unbind())
        }
    }

@@ -91,7 +98,7 @@ pub(crate) mod ext {
    #[ext(pub, name = PyResultExt)]
    impl<T> PyResult<T> {
        fn write_unraisable(self) -> Option<T> {
-            Python::attach(|py| self.write_unraisable_with(py))
+            Python::with_gil(|py| self.write_unraisable_with(py))
        }

        fn write_unraisable_with(self, py: Python<'_>) -> Option<T> {
@@ -168,6 +175,24 @@ pub(crate) mod ext {
    }
 }

+pub(crate) mod private {
+    use std::marker::Sized;
+
+    /// Sealed traits support
+    pub trait Sealed {}
+    impl<T: ?Sized> Sealed for T {}
+}
+
+/// A wrapper around [`Py`] that implements [`Clone`] using [`Python::with_gil`].
+#[repr(transparent)]
+pub(crate) struct ClonePy<T>(pub Py<T>);
+
+impl<T> Clone for ClonePy<T> {
+    fn clone(&self) -> Self {
+        Python::with_gil(|py| Self(self.0.clone_ref(py)))
+    }
+}
+
 /// A Python module implemented in Rust. The name of this function must match
 /// the `lib.name` setting in the `Cargo.toml`, else Python will not be able to
 /// import the module.
--- a/rust/exo_pyo3_bindings/src/networking.rs
+++ b/rust/exo_pyo3_bindings/src/networking.rs
@@ -11,9 +11,9 @@ use crate::ext::{ResultExt as _, TokioMpscReceiverExt as _, TokioMpscSenderExt a
 use crate::pyclass;
 use crate::pylibp2p::ident::{PyKeypair, PyPeerId};
 use libp2p::futures::StreamExt as _;
-use libp2p::gossipsub;
 use libp2p::gossipsub::{IdentTopic, Message, MessageId, PublishError};
 use libp2p::swarm::SwarmEvent;
+use libp2p::{gossipsub, mdns};
 use networking::discovery;
 use networking::swarm::create_swarm;
 use pyo3::prelude::{PyModule, PyModuleMethods as _};
@@ -25,7 +25,7 @@ use tokio::sync::{Mutex, mpsc, oneshot};

 mod exception {
    use pyo3::types::PyTuple;
-    use pyo3::{exceptions::PyException, prelude::*};
+    use pyo3::{PyErrArguments, exceptions::PyException, prelude::*};
    use pyo3_stub_gen::derive::*;

    #[gen_stub_pyclass]
@@ -155,6 +155,7 @@ async fn networking_task(
 ) {
    use SwarmEvent::*;
    use ToTask::*;
+    use mdns::Event::*;
    use networking::swarm::BehaviourEvent::*;

    log::info!("RUST: networking task started");
@@ -484,7 +485,7 @@ impl PyNetworkingHandle {
        let (tx, rx) = oneshot::channel();

        // send off request to subscribe
-        let data = Python::attach(|py| Vec::from(data.as_bytes(py)));
+        let data = Python::with_gil(|py| Vec::from(data.as_bytes(py)));
        self.to_task_tx()
            .send_py(ToTask::GossipsubPublish {
                topic,
--- a/rust/networking/Cargo.toml
+++ b/rust/networking/Cargo.toml
@@ -19,6 +19,8 @@ either = { workspace = true }
 # macro dependencies
 extend = { workspace = true }
 delegate = { workspace = true }
+impl-trait-for-tuples = { workspace = true }
+derive_more = { workspace = true }

 # async
 tokio = { workspace = true, features = ["full"] }
@@ -27,6 +29,11 @@ futures-timer = { workspace = true }

 # utility dependencies
 util = { workspace = true }
+thiserror = { workspace = true }
+#internment = { workspace = true }
+#recursion = { workspace = true }
+#generativity = { workspace = true }
+#itertools = { workspace = true }
 tracing-subscriber = { version = "0.3.19", features = ["default", "env-filter"] }
 keccak-const = { workspace = true }

@@ -34,4 +41,4 @@ keccak-const = { workspace = true }
 log = { workspace = true }

 # networking
-libp2p = { workspace = true, features = ["full"] }
+libp2p = { workspace = true, features = ["full"] }
--- a/rust/networking/examples/chatroom_manual.rs
+++ b/rust/networking/examples/chatroom_manual.rs
@@ -24,8 +24,8 @@ use libp2p::{
    swarm::{NetworkBehaviour, SwarmEvent},
    tcp, yamux,
 };
-use std::error::Error;
 use std::time::Duration;
+use std::{error::Error, hash::Hash};
 use tokio::{io, io::AsyncBufReadExt, select};
 use tracing_subscriber::EnvFilter;

--- a/rust/networking/src/discovery.rs
+++ b/rust/networking/src/discovery.rs
@@ -1,4 +1,5 @@
 use crate::ext::MultiaddrExt;
+use crate::keep_alive;
 use delegate::delegate;
 use either::Either;
 use futures::FutureExt;
--- a/rust/networking/src/keep_alive.rs
+++ b/rust/networking/src/keep_alive.rs
@@ -0,0 +1,44 @@
+use delegate::delegate;
+use libp2p::swarm::handler::ConnectionEvent;
+use libp2p::swarm::{ConnectionHandlerEvent, SubstreamProtocol, dummy, handler};
+use std::task::{Context, Poll};
+
+/// An implementation of [`ConnectionHandler`] that doesn't handle any protocols, but it keeps
+/// the connection alive.
+#[derive(Clone)]
+#[repr(transparent)]
+pub struct ConnectionHandler(dummy::ConnectionHandler);
+
+impl ConnectionHandler {
+    pub fn new() -> Self {
+        ConnectionHandler(dummy::ConnectionHandler)
+    }
+}
+
+impl handler::ConnectionHandler for ConnectionHandler {
+    // delegate types and implementation mostly to dummy handler
+    type FromBehaviour = <dummy::ConnectionHandler as handler::ConnectionHandler>::FromBehaviour;
+    type ToBehaviour = <dummy::ConnectionHandler as handler::ConnectionHandler>::ToBehaviour;
+    type InboundProtocol =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::InboundProtocol;
+    type OutboundProtocol =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::OutboundProtocol;
+    type InboundOpenInfo =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::InboundOpenInfo;
+    type OutboundOpenInfo =
+        <dummy::ConnectionHandler as handler::ConnectionHandler>::OutboundOpenInfo;
+
+    delegate! {
+        to self.0 {
+            fn listen_protocol(&self) -> SubstreamProtocol<Self::InboundProtocol, Self::InboundOpenInfo>;
+            fn poll(&mut self, cx: &mut Context<'_>) -> Poll<ConnectionHandlerEvent<Self::OutboundProtocol, Self::OutboundOpenInfo, Self::ToBehaviour>>;
+            fn on_behaviour_event(&mut self, event: Self::FromBehaviour);
+            fn on_connection_event(&mut self, event: ConnectionEvent<Self::InboundProtocol, Self::OutboundProtocol, Self::InboundOpenInfo, Self::OutboundOpenInfo>);
+        }
+    }
+
+    // specifically override this to force connection to stay alive
+    fn connection_keep_alive(&self) -> bool {
+        true
+    }
+}
--- a/rust/networking/src/lib.rs
+++ b/rust/networking/src/lib.rs
@@ -3,7 +3,19 @@
 //! this is here as a placeholder documentation
 //!
 //!
+
+// enable Rust-unstable features for convenience
+#![feature(trait_alias)]
+// #![feature(stmt_expr_attributes)]
+// #![feature(unboxed_closures)]
+// #![feature(assert_matches)]
+// #![feature(async_fn_in_dyn_trait)]
+// #![feature(async_for_loop)]
+// #![feature(auto_traits)]
+// #![feature(negative_impls)]
+
 pub mod discovery;
+pub mod keep_alive;
 pub mod swarm;

 /// Namespace for all the type/trait aliases used by this crate.
@@ -42,3 +54,11 @@ pub(crate) mod ext {
        }
    }
 }
+
+pub(crate) mod private {
+    #![allow(dead_code)]
+
+    /// Sealed traits support
+    pub trait Sealed {}
+    impl<T: ?Sized> Sealed for T {}
+}
--- a/src/exo/download/coordinator.py
+++ b/src/exo/download/coordinator.py
@@ -14,7 +14,6 @@ from exo.download.download_utils import (
    map_repo_download_progress_to_download_progress_data,
 )
 from exo.download.shard_downloader import ShardDownloader
-from exo.shared.constants import EXO_MODELS_DIR
 from exo.shared.models.model_cards import ModelId
 from exo.shared.types.commands import (
    CancelDownload,
@@ -64,9 +63,6 @@ class DownloadCoordinator:
        self.event_sender, self.event_receiver = channel[Event]()
        self.shard_downloader.on_progress(self._download_progress_callback)

-    def _model_dir(self, model_id: ModelId) -> str:
-        return str(EXO_MODELS_DIR / model_id.normalize())
-
    async def _download_progress_callback(
        self, callback_shard: ShardMetadata, progress: RepoDownloadProgress
    ) -> None:
@@ -78,7 +74,6 @@ class DownloadCoordinator:
                shard_metadata=callback_shard,
                node_id=self.node_id,
                total_bytes=progress.total_bytes,
-                model_directory=self._model_dir(model_id),
            )
            self.download_status[model_id] = completed
            await self.event_sender.send(
@@ -98,7 +93,6 @@ class DownloadCoordinator:
                download_progress=map_repo_download_progress_to_download_progress_data(
                    progress
                ),
-                model_directory=self._model_dir(model_id),
            )
            self.download_status[model_id] = ongoing
            await self.event_sender.send(
@@ -176,11 +170,7 @@ class DownloadCoordinator:
                return

        # Emit pending status
-        progress = DownloadPending(
-            shard_metadata=shard,
-            node_id=self.node_id,
-            model_directory=self._model_dir(model_id),
-        )
+        progress = DownloadPending(shard_metadata=shard, node_id=self.node_id)
        self.download_status[model_id] = progress
        await self.event_sender.send(NodeDownloadProgress(download_progress=progress))

@@ -194,7 +184,6 @@ class DownloadCoordinator:
                shard_metadata=shard,
                node_id=self.node_id,
                total_bytes=initial_progress.total_bytes,
-                model_directory=self._model_dir(model_id),
            )
            self.download_status[model_id] = completed
            await self.event_sender.send(
@@ -217,7 +206,6 @@ class DownloadCoordinator:
            download_progress=map_repo_download_progress_to_download_progress_data(
                initial_progress
            ),
-            model_directory=self._model_dir(model_id),
        )
        self.download_status[model_id] = status
        self.event_sender.send_nowait(NodeDownloadProgress(download_progress=status))
@@ -231,7 +219,6 @@ class DownloadCoordinator:
                    shard_metadata=shard,
                    node_id=self.node_id,
                    error_message=str(e),
-                    model_directory=self._model_dir(model_id),
                )
                self.download_status[model_id] = failed
                await self.event_sender.send(
@@ -266,7 +253,6 @@ class DownloadCoordinator:
            pending = DownloadPending(
                shard_metadata=current_status.shard_metadata,
                node_id=self.node_id,
-                model_directory=self._model_dir(model_id),
            )
            await self.event_sender.send(
                NodeDownloadProgress(download_progress=pending)
@@ -309,18 +295,11 @@ class DownloadCoordinator:
                            node_id=self.node_id,
                            shard_metadata=progress.shard,
                            total_bytes=progress.total_bytes,
-                            model_directory=self._model_dir(
-                                progress.shard.model_card.model_id
-                            ),
                        )
                    elif progress.status in ["in_progress", "not_started"]:
                        if progress.downloaded_bytes_this_session.in_bytes == 0:
                            status = DownloadPending(
-                                node_id=self.node_id,
-                                shard_metadata=progress.shard,
-                                model_directory=self._model_dir(
-                                    progress.shard.model_card.model_id
-                                ),
+                                node_id=self.node_id, shard_metadata=progress.shard
                            )
                        else:
                            status = DownloadOngoing(
@@ -329,9 +308,6 @@ class DownloadCoordinator:
                                download_progress=map_repo_download_progress_to_download_progress_data(
                                    progress
                                ),
-                                model_directory=self._model_dir(
-                                    progress.shard.model_card.model_id
-                                ),
                            )
                    else:
                        continue
--- a/src/exo/main.py
+++ b/src/exo/main.py
@@ -136,8 +136,6 @@ class Node:

    async def run(self):
        async with self._tg as tg:
-            signal.signal(signal.SIGINT, lambda _, __: self.shutdown())
-            signal.signal(signal.SIGTERM, lambda _, __: self.shutdown())
            tg.start_soon(self.router.run)
            tg.start_soon(self.election.run)
            if self.download_coordinator:
@@ -149,6 +147,8 @@ class Node:
            if self.api:
                tg.start_soon(self.api.run)
            tg.start_soon(self._elect_loop)
+            signal.signal(signal.SIGINT, lambda _, __: self.shutdown())
+            signal.signal(signal.SIGTERM, lambda _, __: self.shutdown())

    def shutdown(self):
        # if this is our second call to shutdown, just sys.exit
--- a/src/exo/master/adapters/chat_completions.py
+++ b/src/exo/master/adapters/chat_completions.py
@@ -17,7 +17,6 @@ from exo.shared.types.api import (
    LogprobsContentItem,
    StreamingChoiceResponse,
    ToolCall,
-    Usage,
 )
 from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
 from exo.shared.types.common import CommandId
@@ -126,8 +125,6 @@ async def generate_chat_stream(
    chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
 ) -> AsyncGenerator[str, None]:
    """Generate Chat Completions API streaming events from chunks."""
-    last_usage: Usage | None = None
-
    async for chunk in chunk_stream:
        if isinstance(chunk, ErrorChunk):
            error_response = ErrorResponse(
@@ -141,8 +138,6 @@ async def generate_chat_stream(
            yield "data: [DONE]\n\n"
            return

-        last_usage = chunk.usage or last_usage
-
        if isinstance(chunk, ToolCallChunk):
            tool_call_deltas = [
                ToolCall(
@@ -166,15 +161,12 @@ async def generate_chat_stream(
                        finish_reason="tool_calls",
                    )
                ],
-                usage=last_usage,
            )
            yield f"data: {tool_response.model_dump_json()}\n\n"
            yield "data: [DONE]\n\n"
            return

        chunk_response = chunk_to_response(chunk, command_id)
-        if chunk.finish_reason is not None:
-            chunk_response = chunk_response.model_copy(update={"usage": last_usage})
        yield f"data: {chunk_response.model_dump_json()}\n\n"

        if chunk.finish_reason is not None:
@@ -184,9 +176,7 @@ async def generate_chat_stream(
 async def collect_chat_response(
    command_id: CommandId,
    chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
-) -> AsyncGenerator[str]:
-    # This is an AsyncGenerator[str] rather than returning a ChatCompletionReponse because
-    # FastAPI handles the cancellation better but wouldn't auto-serialize for some reason
+) -> ChatCompletionResponse:
    """Collect all token chunks and return a single ChatCompletionResponse."""
    text_parts: list[str] = []
    tool_calls: list[ToolCall] = []
@@ -194,7 +184,6 @@ async def collect_chat_response(
    model: str | None = None
    finish_reason: FinishReason | None = None
    error_message: str | None = None
-    last_usage: Usage | None = None

    async for chunk in chunk_stream:
        if isinstance(chunk, ErrorChunk):
@@ -204,8 +193,6 @@ async def collect_chat_response(
        if model is None:
            model = chunk.model

-        last_usage = chunk.usage or last_usage
-
        if isinstance(chunk, TokenChunk):
            text_parts.append(chunk.text)
            if chunk.logprob is not None:
@@ -236,7 +223,7 @@ async def collect_chat_response(
    combined_text = "".join(text_parts)
    assert model is not None

-    yield ChatCompletionResponse(
+    return ChatCompletionResponse(
        id=command_id,
        created=int(time.time()),
        model=model,
@@ -254,6 +241,4 @@ async def collect_chat_response(
                finish_reason=finish_reason,
            )
        ],
-        usage=last_usage,
-    ).model_dump_json()
-    return
+    )
--- a/src/exo/master/adapters/claude.py
+++ b/src/exo/master/adapters/claude.py
@@ -4,7 +4,7 @@ import json
 from collections.abc import AsyncGenerator
 from typing import Any

-from exo.shared.types.api import FinishReason, Usage
+from exo.shared.types.api import FinishReason
 from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
 from exo.shared.types.claude_api import (
    ClaudeContentBlock,
@@ -161,14 +161,12 @@ async def collect_claude_response(
    command_id: CommandId,
    model: str,
    chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
-) -> AsyncGenerator[str]:
-    # This is an AsyncGenerator[str] rather than returning a ChatCompletionReponse because
-    # FastAPI handles the cancellation better but wouldn't auto-serialize for some reason
+) -> ClaudeMessagesResponse:
    """Collect all token chunks and return a single ClaudeMessagesResponse."""
    text_parts: list[str] = []
    tool_use_blocks: list[ClaudeToolUseBlock] = []
    stop_reason: ClaudeStopReason | None = None
-    last_usage: Usage | None = None
+    last_stats = None
    error_message: str | None = None

    async for chunk in chunk_stream:
@@ -176,8 +174,6 @@ async def collect_claude_response(
            error_message = chunk.error_message or "Internal server error"
            break

-        last_usage = chunk.usage or last_usage
-
        if isinstance(chunk, ToolCallChunk):
            for tool in chunk.tool_calls:
                tool_use_blocks.append(
@@ -187,10 +183,12 @@ async def collect_claude_response(
                        input=json.loads(tool.arguments),  # pyright: ignore[reportAny]
                    )
                )
+            last_stats = chunk.stats or last_stats
            stop_reason = "tool_use"
            continue

        text_parts.append(chunk.text)
+        last_stats = chunk.stats or last_stats

        if chunk.finish_reason is not None:
            stop_reason = finish_reason_to_claude_stop_reason(chunk.finish_reason)
@@ -210,11 +208,11 @@ async def collect_claude_response(
    if not content:
        content.append(ClaudeTextBlock(text=""))

-    # Use actual usage data if available
-    input_tokens = last_usage.prompt_tokens if last_usage else 0
-    output_tokens = last_usage.completion_tokens if last_usage else 0
+    # Use actual usage data from stats if available
+    input_tokens = last_stats.prompt_tokens if last_stats else 0
+    output_tokens = last_stats.generation_tokens if last_stats else 0

-    yield ClaudeMessagesResponse(
+    return ClaudeMessagesResponse(
        id=f"msg_{command_id}",
        model=model,
        content=content,
@@ -223,8 +221,7 @@ async def collect_claude_response(
            input_tokens=input_tokens,
            output_tokens=output_tokens,
        ),
-    ).model_dump_json()
-    return
+    )


 async def generate_claude_stream(
@@ -252,7 +249,7 @@ async def generate_claude_stream(

    output_tokens = 0
    stop_reason: ClaudeStopReason | None = None
-    last_usage: Usage | None = None
+    last_stats = None
    next_block_index = 1  # text block is 0, tool blocks start at 1

    async for chunk in chunk_stream:
@@ -260,9 +257,8 @@ async def generate_claude_stream(
            # Close text block and bail
            break

-        last_usage = chunk.usage or last_usage
-
        if isinstance(chunk, ToolCallChunk):
+            last_stats = chunk.stats or last_stats
            stop_reason = "tool_use"

            # Emit tool_use content blocks
@@ -294,6 +290,7 @@ async def generate_claude_stream(
            continue

        output_tokens += 1  # Count each chunk as one token
+        last_stats = chunk.stats or last_stats

        # content_block_delta
        delta_event = ClaudeContentBlockDeltaEvent(
@@ -305,9 +302,9 @@ async def generate_claude_stream(
        if chunk.finish_reason is not None:
            stop_reason = finish_reason_to_claude_stop_reason(chunk.finish_reason)

-    # Use actual token count from usage if available
-    if last_usage is not None:
-        output_tokens = last_usage.completion_tokens
+    # Use actual token count from stats if available
+    if last_stats is not None:
+        output_tokens = last_stats.generation_tokens

    # content_block_stop for text block
    block_stop = ClaudeContentBlockStopEvent(index=0)
--- a/src/exo/master/adapters/responses.py
+++ b/src/exo/master/adapters/responses.py
@@ -4,7 +4,6 @@ from collections.abc import AsyncGenerator
 from itertools import count
 from typing import Any

-from exo.shared.types.api import Usage
 from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
 from exo.shared.types.common import CommandId
 from exo.shared.types.openai_responses import (
@@ -122,15 +121,13 @@ async def collect_responses_response(
    command_id: CommandId,
    model: str,
    chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
-) -> AsyncGenerator[str]:
-    # This is an AsyncGenerator[str] rather than returning a ChatCompletionReponse because
-    # FastAPI handles the cancellation better but wouldn't auto-serialize for some reason
+) -> ResponsesResponse:
    """Collect all token chunks and return a single ResponsesResponse."""
    response_id = f"resp_{command_id}"
    item_id = f"item_{command_id}"
    accumulated_text = ""
    function_call_items: list[ResponseFunctionCallItem] = []
-    last_usage: Usage | None = None
+    last_stats = None
    error_message: str | None = None

    async for chunk in chunk_stream:
@@ -138,32 +135,32 @@ async def collect_responses_response(
            error_message = chunk.error_message or "Internal server error"
            break

-        last_usage = chunk.usage or last_usage
-
        if isinstance(chunk, ToolCallChunk):
            for tool in chunk.tool_calls:
                function_call_items.append(
                    ResponseFunctionCallItem(
-                        id=tool.id,
-                        call_id=tool.id,
+                        id=f"fc_{tool.id}",
+                        call_id=f"call_{tool.id}",
                        name=tool.name,
                        arguments=tool.arguments,
                    )
                )
+            last_stats = chunk.stats or last_stats
            continue

        accumulated_text += chunk.text
+        last_stats = chunk.stats or last_stats

    if error_message is not None:
        raise ValueError(error_message)

-    # Create usage from usage data if available
+    # Create usage from stats if available
    usage = None
-    if last_usage is not None:
+    if last_stats is not None:
        usage = ResponseUsage(
-            input_tokens=last_usage.prompt_tokens,
-            output_tokens=last_usage.completion_tokens,
-            total_tokens=last_usage.total_tokens,
+            input_tokens=last_stats.prompt_tokens,
+            output_tokens=last_stats.generation_tokens,
+            total_tokens=last_stats.prompt_tokens + last_stats.generation_tokens,
        )

    output: list[ResponseItem] = [
@@ -175,15 +172,14 @@ async def collect_responses_response(
    ]
    output.extend(function_call_items)

-    yield ResponsesResponse(
+    return ResponsesResponse(
        id=response_id,
        model=model,
        status="completed",
        output=output,
        output_text=accumulated_text,
        usage=usage,
-    ).model_dump_json()
-    return
+    )


 async def generate_responses_stream(
@@ -239,16 +235,15 @@ async def generate_responses_stream(

    accumulated_text = ""
    function_call_items: list[ResponseFunctionCallItem] = []
-    last_usage: Usage | None = None
+    last_stats = None
    next_output_index = 1  # message item is at 0

    async for chunk in chunk_stream:
        if isinstance(chunk, ErrorChunk):
            break

-        last_usage = chunk.usage or last_usage
-
        if isinstance(chunk, ToolCallChunk):
+            last_stats = chunk.stats or last_stats
            for tool in chunk.tool_calls:
                fc_id = f"fc_{tool.id}"
                call_id = f"call_{tool.id}"
@@ -307,6 +302,7 @@ async def generate_responses_stream(
            continue

        accumulated_text += chunk.text
+        last_stats = chunk.stats or last_stats

        # response.output_text.delta
        delta_event = ResponseTextDeltaEvent(
@@ -350,13 +346,13 @@ async def generate_responses_stream(
    )
    yield f"event: response.output_item.done\ndata: {item_done.model_dump_json()}\n\n"

-    # Create usage from usage data if available
+    # Create usage from stats if available
    usage = None
-    if last_usage is not None:
+    if last_stats is not None:
        usage = ResponseUsage(
-            input_tokens=last_usage.prompt_tokens,
-            output_tokens=last_usage.completion_tokens,
-            total_tokens=last_usage.total_tokens,
+            input_tokens=last_stats.prompt_tokens,
+            output_tokens=last_stats.generation_tokens,
+            total_tokens=last_stats.prompt_tokens + last_stats.generation_tokens,
        )

    # response.completed
--- a/src/exo/master/api.py
+++ b/src/exo/master/api.py
@@ -125,7 +125,6 @@ from exo.shared.types.commands import (
    PlaceInstance,
    SendInputChunk,
    StartDownload,
-    TaskCancelled,
    TaskFinished,
    TextGeneration,
 )
@@ -541,14 +540,16 @@ class API:
                        break

        except anyio.get_cancelled_exc_class():
-            command = TaskCancelled(cancelled_command_id=command_id)
-            with anyio.CancelScope(shield=True):
-                await self.command_sender.send(
-                    ForwarderCommand(origin=self.node_id, command=command)
-                )
+            # TODO: TaskCancelled
+            """
+            self.command_sender.send_nowait(
+                ForwarderCommand(origin=self.node_id, command=command)
+            )
+            """
            raise
        finally:
-            await self._send(TaskFinished(finished_command_id=command_id))
+            command = TaskFinished(finished_command_id=command_id)
+            await self._send(command)
            if command_id in self._text_generation_queues:
                del self._text_generation_queues[command_id]

@@ -643,14 +644,11 @@ class API:
                    "X-Accel-Buffering": "no",
                },
            )
-        else:
-            return StreamingResponse(
-                collect_chat_response(
-                    command.command_id,
-                    self._token_chunk_stream(command.command_id),
-                ),
-                media_type="application/json",
-            )
+
+        return await collect_chat_response(
+            command.command_id,
+            self._token_chunk_stream(command.command_id),
+        )

    async def bench_chat_completions(
        self, payload: BenchChatCompletionRequest
@@ -666,7 +664,8 @@ class API:
        command = TextGeneration(task_params=task_params)
        await self._send(command)

-        return await self._collect_text_generation_with_stats(command.command_id)
+        response = await self._collect_text_generation_with_stats(command.command_id)
+        return response

    async def _resolve_and_validate_text_model(self, model_id: ModelId) -> ModelId:
        """Validate a text model exists and return the resolved model ID.
@@ -884,11 +883,6 @@ class API:
                        del image_metadata[key]

        except anyio.get_cancelled_exc_class():
-            command = TaskCancelled(cancelled_command_id=command_id)
-            with anyio.CancelScope(shield=True):
-                await self.command_sender.send(
-                    ForwarderCommand(origin=self.node_id, command=command)
-                )
            raise
        finally:
            await self._send(TaskFinished(finished_command_id=command_id))
@@ -970,11 +964,6 @@ class API:

            return (images, stats if capture_stats else None)
        except anyio.get_cancelled_exc_class():
-            command = TaskCancelled(cancelled_command_id=command_id)
-            with anyio.CancelScope(shield=True):
-                await self.command_sender.send(
-                    ForwarderCommand(origin=self.node_id, command=command)
-                )
            raise
        finally:
            await self._send(TaskFinished(finished_command_id=command_id))
@@ -1232,15 +1221,12 @@ class API:
                    "X-Accel-Buffering": "no",
                },
            )
-        else:
-            return StreamingResponse(
-                collect_claude_response(
-                    command.command_id,
-                    payload.model,
-                    self._token_chunk_stream(command.command_id),
-                ),
-                media_type="application/json",
-            )
+
+        return await collect_claude_response(
+            command.command_id,
+            payload.model,
+            self._token_chunk_stream(command.command_id),
+        )

    async def openai_responses(
        self, payload: ResponsesRequest
@@ -1268,15 +1254,11 @@ class API:
                },
            )

-        else:
-            return StreamingResponse(
-                collect_responses_response(
-                    command.command_id,
-                    payload.model,
-                    self._token_chunk_stream(command.command_id),
-                ),
-                media_type="application/json",
-            )
+        return await collect_responses_response(
+            command.command_id,
+            payload.model,
+            self._token_chunk_stream(command.command_id),
+        )

    def _calculate_total_available_memory(self) -> Memory:
        """Calculate total available memory across all nodes in bytes."""
--- a/src/exo/master/main.py
+++ b/src/exo/master/main.py
@@ -24,7 +24,6 @@ from exo.shared.types.commands import (
    PlaceInstance,
    RequestEventLog,
    SendInputChunk,
-    TaskCancelled,
    TaskFinished,
    TestCommand,
    TextGeneration,
@@ -40,7 +39,6 @@ from exo.shared.types.events import (
    NodeTimedOut,
    TaskCreated,
    TaskDeleted,
-    TaskStatusUpdated,
    TraceEventData,
    TracesCollected,
    TracesMerged,
@@ -281,7 +279,7 @@ class Master:
                        case DeleteInstance():
                            placement = delete_instance(command, self.state.instances)
                            transition_events = get_transition_events(
-                                self.state.instances, placement, self.state.tasks
+                                self.state.instances, placement
                            )
                            for cmd in cancel_unnecessary_downloads(
                                placement, self.state.downloads
@@ -301,7 +299,7 @@ class Master:
                                self.state.node_network,
                            )
                            transition_events = get_transition_events(
-                                self.state.instances, placement, self.state.tasks
+                                self.state.instances, placement
                            )
                            generated_events.extend(transition_events)
                        case CreateInstance():
@@ -311,7 +309,7 @@ class Master:
                                self.state.instances,
                            )
                            transition_events = get_transition_events(
-                                self.state.instances, placement, self.state.tasks
+                                self.state.instances, placement
                            )
                            generated_events.extend(transition_events)
                        case SendInputChunk(chunk=chunk):
@@ -321,18 +319,6 @@ class Master:
                                    chunk=chunk,
                                )
                            )
-                        case TaskCancelled():
-                            if (
-                                task_id := self.command_task_mapping.get(
-                                    command.cancelled_command_id
-                                )
-                            ) is not None:
-                                generated_events.append(
-                                    TaskStatusUpdated(
-                                        task_status=TaskStatus.Cancelled,
-                                        task_id=task_id,
-                                    )
-                                )
                        case TaskFinished():
                            generated_events.append(
                                TaskDeleted(
@@ -341,9 +327,10 @@ class Master:
                                    ]
                                )
                            )
-                            self.command_task_mapping.pop(
-                                command.finished_command_id, None
-                            )
+                            if command.finished_command_id in self.command_task_mapping:
+                                del self.command_task_mapping[
+                                    command.finished_command_id
+                                ]
                        case RequestEventLog():
                            # We should just be able to send everything, since other buffers will ignore old messages
                            # rate limit to 1000 at a time
--- a/src/exo/master/placement.py
+++ b/src/exo/master/placement.py
@@ -22,15 +22,9 @@ from exo.shared.types.commands import (
    PlaceInstance,
 )
 from exo.shared.types.common import NodeId
-from exo.shared.types.events import (
-    Event,
-    InstanceCreated,
-    InstanceDeleted,
-    TaskStatusUpdated,
-)
+from exo.shared.types.events import Event, InstanceCreated, InstanceDeleted
 from exo.shared.types.memory import Memory
 from exo.shared.types.profiling import MemoryUsage, NodeNetworkInfo
-from exo.shared.types.tasks import Task, TaskId, TaskStatus
 from exo.shared.types.worker.downloads import (
    DownloadOngoing,
    DownloadProgress,
@@ -192,7 +186,6 @@ def delete_instance(
 def get_transition_events(
    current_instances: Mapping[InstanceId, Instance],
    target_instances: Mapping[InstanceId, Instance],
-    tasks: Mapping[TaskId, Task],
 ) -> Sequence[Event]:
    events: list[Event] = []

@@ -208,18 +201,6 @@ def get_transition_events(
    # find instances to delete
    for instance_id in current_instances:
        if instance_id not in target_instances:
-            for task in tasks.values():
-                if task.instance_id == instance_id and task.task_status in [
-                    TaskStatus.Pending,
-                    TaskStatus.Running,
-                ]:
-                    events.append(
-                        TaskStatusUpdated(
-                            task_status=TaskStatus.Cancelled,
-                            task_id=task.task_id,
-                        )
-                    )
-
            events.append(
                InstanceDeleted(
                    instance_id=instance_id,
--- a/src/exo/master/tests/test_claude_tool_use.py
+++ b/src/exo/master/tests/test_claude_tool_use.py
@@ -4,11 +4,7 @@ import json
 from collections.abc import AsyncGenerator
 from typing import Any, cast

-from exo.master.adapters.claude import (
-    ClaudeMessagesResponse,
-    collect_claude_response,
-    generate_claude_stream,
-)
+from exo.master.adapters.claude import collect_claude_response, generate_claude_stream
 from exo.shared.types.api import ToolCallItem
 from exo.shared.types.chunks import ErrorChunk, TokenChunk, ToolCallChunk
 from exo.shared.types.common import CommandId, ModelId
@@ -21,18 +17,6 @@ async def _chunks_to_stream(
        yield chunk


-async def _collect_response(
-    command_id: CommandId,
-    model: str,
-    chunk_stream: AsyncGenerator[ErrorChunk | ToolCallChunk | TokenChunk, None],
-) -> ClaudeMessagesResponse:
-    """Helper to consume the async generator and parse the JSON response."""
-    parts: list[str] = []
-    async for part in collect_claude_response(command_id, model, chunk_stream):
-        parts.append(part)
-    return ClaudeMessagesResponse.model_validate_json("".join(parts))
-
-
 MODEL = ModelId("test-model")
 COMMAND_ID = CommandId("cmd_test123")

@@ -63,7 +47,7 @@ class TestCollectClaudeResponseToolUse:
                ],
            ),
        ]
-        response = await _collect_response(
+        response = await collect_claude_response(
            COMMAND_ID, "test-model", _chunks_to_stream(chunks)
        )

@@ -93,7 +77,7 @@ class TestCollectClaudeResponseToolUse:
                ],
            ),
        ]
-        response = await _collect_response(
+        response = await collect_claude_response(
            COMMAND_ID, "test-model", _chunks_to_stream(chunks)
        )

@@ -118,7 +102,7 @@ class TestCollectClaudeResponseToolUse:
                ],
            ),
        ]
-        response = await _collect_response(
+        response = await collect_claude_response(
            COMMAND_ID, "test-model", _chunks_to_stream(chunks)
        )

@@ -132,7 +116,7 @@ class TestCollectClaudeResponseToolUse:

    async def test_no_content_produces_empty_text_block(self):
        chunks: list[ErrorChunk | ToolCallChunk | TokenChunk] = []
-        response = await _collect_response(
+        response = await collect_claude_response(
            COMMAND_ID, "test-model", _chunks_to_stream(chunks)
        )
        assert len(response.content) == 1
--- a/src/exo/master/tests/test_placement.py
+++ b/src/exo/master/tests/test_placement.py
@@ -239,7 +239,7 @@ def test_get_transition_events_no_change(instance: Instance):
    target_instances = {instance_id: instance}

    # act
-    events = get_transition_events(current_instances, target_instances, {})
+    events = get_transition_events(current_instances, target_instances)

    # assert
    assert len(events) == 0
@@ -252,7 +252,7 @@ def test_get_transition_events_create_instance(instance: Instance):
    target_instances: dict[InstanceId, Instance] = {instance_id: instance}

    # act
-    events = get_transition_events(current_instances, target_instances, {})
+    events = get_transition_events(current_instances, target_instances)

    # assert
    assert len(events) == 1
@@ -266,7 +266,7 @@ def test_get_transition_events_delete_instance(instance: Instance):
    target_instances: dict[InstanceId, Instance] = {}

    # act
-    events = get_transition_events(current_instances, target_instances, {})
+    events = get_transition_events(current_instances, target_instances)

    # assert
    assert len(events) == 1
--- a/src/exo/shared/apply.py
+++ b/src/exo/shared/apply.py
@@ -184,10 +184,19 @@ def apply_instance_created(event: InstanceCreated, state: State) -> State:


 def apply_instance_deleted(event: InstanceDeleted, state: State) -> State:
+    deleted_instance = state.instances.get(event.instance_id)
    new_instances: Mapping[InstanceId, Instance] = {
        iid: inst for iid, inst in state.instances.items() if iid != event.instance_id
    }
-    return state.model_copy(update={"instances": new_instances})
+    runner_ids_to_remove: set[RunnerId] = set()
+    if deleted_instance is not None:
+        runner_ids_to_remove = set(
+            deleted_instance.shard_assignments.runner_to_shard.keys()
+        )
+    new_runners: Mapping[RunnerId, RunnerStatus] = {
+        rid: rs for rid, rs in state.runners.items() if rid not in runner_ids_to_remove
+    }
+    return state.model_copy(update={"instances": new_instances, "runners": new_runners})


 def apply_runner_status_updated(event: RunnerStatusUpdated, state: State) -> State:
@@ -218,6 +227,11 @@ def apply_node_timed_out(event: NodeTimedOut, state: State) -> State:
        key: value for key, value in state.downloads.items() if key != event.node_id
    }
    # Clean up all granular node mappings
+    node_identities = {
+        key: value
+        for key, value in state.node_identities.items()
+        if key != event.node_id
+    }
    node_memory = {
        key: value for key, value in state.node_memory.items() if key != event.node_id
    }
@@ -258,6 +272,7 @@ def apply_node_timed_out(event: NodeTimedOut, state: State) -> State:
            "downloads": downloads,
            "topology": topology,
            "last_seen": last_seen,
+            "node_identities": node_identities,
            "node_memory": node_memory,
            "node_disk": node_disk,
            "node_system": node_system,
--- a/src/exo/shared/tests/test_apply/test_apply_instance_deleted.py
+++ b/src/exo/shared/tests/test_apply/test_apply_instance_deleted.py
@@ -0,0 +1,142 @@
+from exo.shared.apply import apply_instance_deleted
+from exo.shared.models.model_cards import ModelId
+from exo.shared.tests.conftest import get_pipeline_shard_metadata
+from exo.shared.types.common import NodeId
+from exo.shared.types.events import InstanceDeleted
+from exo.shared.types.state import State
+from exo.shared.types.worker.instances import InstanceId, MlxRingInstance
+from exo.shared.types.worker.runners import (
+    RunnerId,
+    RunnerReady,
+    ShardAssignments,
+)
+from exo.shared.types.worker.shards import ShardMetadata
+from exo.worker.tests.constants import (
+    INSTANCE_1_ID,
+    INSTANCE_2_ID,
+    MODEL_A_ID,
+    MODEL_B_ID,
+    NODE_A,
+    NODE_B,
+    RUNNER_1_ID,
+    RUNNER_2_ID,
+)
+
+
+def _make_instance(
+    instance_id: InstanceId,
+    model_id: ModelId,
+    node_to_runner: dict[NodeId, RunnerId],
+    runner_to_shard: dict[RunnerId, ShardMetadata],
+) -> MlxRingInstance:
+    return MlxRingInstance(
+        instance_id=instance_id,
+        shard_assignments=ShardAssignments(
+            model_id=model_id,
+            node_to_runner=node_to_runner,
+            runner_to_shard=runner_to_shard,
+        ),
+        hosts_by_node={},
+        ephemeral_port=50000,
+    )
+
+
+def test_instance_deleted_removes_runners():
+    """Deleting an instance must also remove its runner entries from state."""
+    shard = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=0)
+    instance = _make_instance(
+        INSTANCE_1_ID,
+        MODEL_A_ID,
+        {NODE_A: RUNNER_1_ID},
+        {RUNNER_1_ID: shard},
+    )
+    state = State(
+        instances={INSTANCE_1_ID: instance},
+        runners={RUNNER_1_ID: RunnerReady()},
+    )
+
+    new_state = apply_instance_deleted(
+        InstanceDeleted(instance_id=INSTANCE_1_ID), state
+    )
+
+    assert INSTANCE_1_ID not in new_state.instances
+    assert RUNNER_1_ID not in new_state.runners
+
+
+def test_instance_deleted_removes_only_its_runners():
+    """Deleting one instance must not remove runners belonging to another."""
+    shard_a = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=0)
+    shard_b = get_pipeline_shard_metadata(MODEL_B_ID, device_rank=0)
+    instance_1 = _make_instance(
+        INSTANCE_1_ID,
+        MODEL_A_ID,
+        {NODE_A: RUNNER_1_ID},
+        {RUNNER_1_ID: shard_a},
+    )
+    instance_2 = _make_instance(
+        INSTANCE_2_ID,
+        MODEL_B_ID,
+        {NODE_B: RUNNER_2_ID},
+        {RUNNER_2_ID: shard_b},
+    )
+    state = State(
+        instances={INSTANCE_1_ID: instance_1, INSTANCE_2_ID: instance_2},
+        runners={RUNNER_1_ID: RunnerReady(), RUNNER_2_ID: RunnerReady()},
+    )
+
+    new_state = apply_instance_deleted(
+        InstanceDeleted(instance_id=INSTANCE_1_ID), state
+    )
+
+    assert INSTANCE_1_ID not in new_state.instances
+    assert RUNNER_1_ID not in new_state.runners
+    # Instance 2 and its runner must remain
+    assert INSTANCE_2_ID in new_state.instances
+    assert RUNNER_2_ID in new_state.runners
+
+
+def test_instance_deleted_multi_node_removes_all_runners():
+    """Deleting a multi-node instance removes all of its runners."""
+    shard1 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=0, world_size=2)
+    shard2 = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=1, world_size=2)
+    instance = _make_instance(
+        INSTANCE_1_ID,
+        MODEL_A_ID,
+        {NODE_A: RUNNER_1_ID, NODE_B: RUNNER_2_ID},
+        {RUNNER_1_ID: shard1, RUNNER_2_ID: shard2},
+    )
+    state = State(
+        instances={INSTANCE_1_ID: instance},
+        runners={RUNNER_1_ID: RunnerReady(), RUNNER_2_ID: RunnerReady()},
+    )
+
+    new_state = apply_instance_deleted(
+        InstanceDeleted(instance_id=INSTANCE_1_ID), state
+    )
+
+    assert INSTANCE_1_ID not in new_state.instances
+    assert RUNNER_1_ID not in new_state.runners
+    assert RUNNER_2_ID not in new_state.runners
+    assert len(new_state.runners) == 0
+
+
+def test_instance_deleted_unknown_id_is_noop_for_runners():
+    """Deleting a non-existent instance should not affect runners."""
+    shard = get_pipeline_shard_metadata(MODEL_A_ID, device_rank=0)
+    instance = _make_instance(
+        INSTANCE_1_ID,
+        MODEL_A_ID,
+        {NODE_A: RUNNER_1_ID},
+        {RUNNER_1_ID: shard},
+    )
+    unknown_id = InstanceId("99999999-9999-4999-8999-999999999999")
+    state = State(
+        instances={INSTANCE_1_ID: instance},
+        runners={RUNNER_1_ID: RunnerReady()},
+    )
+
+    new_state = apply_instance_deleted(InstanceDeleted(instance_id=unknown_id), state)
+
+    # Everything should remain untouched
+    assert INSTANCE_1_ID in new_state.instances
+    assert RUNNER_1_ID in new_state.runners
--- a/src/exo/shared/types/api.py
+++ b/src/exo/shared/types/api.py
@@ -3,7 +3,8 @@ from collections.abc import Generator
 from typing import Annotated, Any, Literal
 from uuid import uuid4

-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, field_validator
+from pydantic_core import PydanticUseDefault

 from exo.shared.models.model_cards import ModelCard, ModelId
 from exo.shared.types.common import CommandId, NodeId
@@ -227,6 +228,13 @@ class PlaceInstanceParams(BaseModel):
    instance_meta: InstanceMeta = InstanceMeta.MlxRing
    min_nodes: int = 1

+    @field_validator("sharding", "instance_meta", mode="plain")
+    @classmethod
+    def use_default(cls, v: object):
+        if not v or not isinstance(v, (Sharding, InstanceMeta)):
+            raise PydanticUseDefault()
+        return v
+

 class CreateInstanceParams(BaseModel):
    instance: Instance
--- a/src/exo/shared/types/commands.py
+++ b/src/exo/shared/types/commands.py
@@ -48,10 +48,6 @@ class DeleteInstance(BaseCommand):
    instance_id: InstanceId


-class TaskCancelled(BaseCommand):
-    cancelled_command_id: CommandId
-
-
 class TaskFinished(BaseCommand):
    finished_command_id: CommandId

@@ -93,7 +89,6 @@ Command = (
    | PlaceInstance
    | CreateInstance
    | DeleteInstance
-    | TaskCancelled
    | TaskFinished
    | SendInputChunk
 )
--- a/src/exo/shared/types/tasks.py
+++ b/src/exo/shared/types/tasks.py
@@ -24,7 +24,6 @@ class TaskStatus(str, Enum):
    Complete = "Complete"
    TimedOut = "TimedOut"
    Failed = "Failed"
-    Cancelled = "Cancelled"


 class BaseTask(TaggedModel):
@@ -61,11 +60,6 @@ class TextGeneration(BaseTask):  # emitted by Master
    error_message: str | None = Field(default=None)


-class CancelTask(BaseTask):
-    cancelled_task_id: TaskId
-    runner_id: RunnerId
-
-
 class ImageGeneration(BaseTask):  # emitted by Master
    command_id: CommandId
    task_params: ImageGenerationTaskParams
@@ -93,7 +87,6 @@ Task = (
    | LoadModel
    | StartWarmup
    | TextGeneration
-    | CancelTask
    | ImageGeneration
    | ImageEdits
    | Shutdown
--- a/src/exo/shared/types/worker/downloads.py
+++ b/src/exo/shared/types/worker/downloads.py
@@ -26,7 +26,6 @@ class DownloadProgressData(CamelCaseModel):
 class BaseDownloadProgress(TaggedModel):
    node_id: NodeId
    shard_metadata: ShardMetadata
-    model_directory: str = ""


 class DownloadPending(BaseDownloadProgress):
--- a/src/exo/shared/types/worker/runner_response.py
+++ b/src/exo/shared/types/worker/runner_response.py
@@ -62,7 +62,6 @@ class PartialImageResponse(BaseRunnerResponse):
 class ToolCallResponse(BaseRunnerResponse):
    tool_calls: list[ToolCallItem]
    usage: Usage | None
-    stats: GenerationStats | None = None


 class FinishedResponse(BaseRunnerResponse):
--- a/src/exo/shared/types/worker/runners.py
+++ b/src/exo/shared/types/worker/runners.py
@@ -50,7 +50,9 @@ class RunnerReady(BaseRunnerStatus):


 class RunnerRunning(BaseRunnerStatus):
-    pass
+    """Runner is processing requests and can accept more (continuous batching)."""
+
+    active_requests: int = 0


 class RunnerShuttingDown(BaseRunnerStatus):
--- a/src/exo/utils/banner.py
+++ b/src/exo/utils/banner.py
@@ -1,7 +1,5 @@
-import sys
-
-
 def print_startup_banner(port: int) -> None:
+    """Print a prominent startup banner with API endpoint information."""
    dashboard_url = f"http://localhost:{port}"
    banner = f"""
 ╔═══════════════════════════════════════════════════════════════════════╗
@@ -29,4 +27,4 @@ def print_startup_banner(port: int) -> None:

 """

-    print(banner, file=sys.stderr)
+    print(banner)
--- a/src/exo/utils/channels.py
+++ b/src/exo/utils/channels.py
@@ -125,9 +125,7 @@ class MpSender[T]:
            self._state.buffer.put(item, block=True)

    async def send_async(self, item: T) -> None:
-        await to_thread.run_sync(
-            self.send, item, limiter=CapacityLimiter(1), abandon_on_cancel=True
-        )
+        await to_thread.run_sync(self.send, item, limiter=CapacityLimiter(1))

    def close(self) -> None:
        if not self._state.closed.is_set():
--- a/src/exo/worker/engines/mlx/generator/batch_engine.py
+++ b/src/exo/worker/engines/mlx/generator/batch_engine.py
@@ -0,0 +1,317 @@
+"""Batch generation engine using mlx_lm's BatchGenerator for continuous batching."""
+
+import time
+from dataclasses import dataclass, field
+from typing import get_args
+
+import mlx.core as mx
+from mlx_lm.generate import BatchGenerator
+from mlx_lm.sample_utils import make_sampler
+from mlx_lm.tokenizer_utils import StreamingDetokenizer, TokenizerWrapper
+
+from exo.shared.types.api import FinishReason, GenerationStats
+from exo.shared.types.common import CommandId
+from exo.shared.types.memory import Memory
+from exo.shared.types.tasks import TaskId
+from exo.shared.types.text_generation import TextGenerationTaskParams
+from exo.shared.types.worker.runner_response import GenerationResponse
+from exo.worker.engines.mlx import Model
+from exo.worker.engines.mlx.constants import MAX_TOKENS
+from exo.worker.engines.mlx.generator.distributed_sync import share_object
+from exo.worker.engines.mlx.utils_mlx import apply_chat_template
+from exo.worker.runner.bootstrap import logger
+
+
+@dataclass
+class PendingInsert:
+    """Pre-tokenized request ready for batch insertion."""
+
+    command_id: CommandId
+    task_id: TaskId
+    tokens: list[int]
+    max_tokens: int
+    prompt_tokens: int
+    temperature: float | None = None
+    top_p: float | None = None
+    top_k: int | None = None
+
+
+@dataclass
+class ActiveRequest:
+    """Tracks an active request in the batch."""
+
+    command_id: CommandId
+    task_id: TaskId
+    uid: int  # BatchGenerator's internal ID
+    detokenizer: StreamingDetokenizer
+    tokens_generated: int = 0
+    prompt_tokens: int = 0
+    start_time: float = field(default_factory=time.perf_counter)
+
+
+@dataclass
+class BatchedGenerationResponse:
+    """Response from batch engine, tagged with command_id and task_id."""
+
+    command_id: CommandId
+    task_id: TaskId
+    response: GenerationResponse
+
+
+class BatchGenerationEngine:
+    """Manages continuous batching using mlx_lm's BatchGenerator."""
+
+    def __init__(
+        self,
+        model: Model,
+        tokenizer: TokenizerWrapper,
+        group: mx.distributed.Group | None = None,
+        max_tokens: int = MAX_TOKENS,
+        completion_batch_size: int = 32,
+        prefill_batch_size: int = 8,
+        prefill_step_size: int = 2048,
+    ):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.max_tokens = max_tokens
+        self.active_requests: dict[int, ActiveRequest] = {}
+        self._pending_inserts: list[PendingInsert] = []
+        self._pending_completions: list[
+            int
+        ] = []  # UIDs completed but not yet synced/removed
+
+        self.group = group
+        self.rank = group.rank() if group else 0
+        self.is_distributed = group is not None and group.size() > 1
+
+        sampler = make_sampler(temp=0.7, top_p=1.0)
+
+        eos_tokens: set[int] = set(tokenizer.eos_token_ids or [])
+
+        self.batch_gen: BatchGenerator = BatchGenerator(
+            model=model,
+            max_tokens=max_tokens,
+            stop_tokens=eos_tokens,
+            sampler=sampler,
+            completion_batch_size=completion_batch_size,
+            prefill_batch_size=prefill_batch_size,
+            prefill_step_size=prefill_step_size,
+        )
+
+        logger.info(
+            f"BatchGenerationEngine initialized with completion_batch_size={completion_batch_size}, "
+            f"prefill_batch_size={prefill_batch_size}, distributed={self.is_distributed}"
+        )
+
+    def queue_request(
+        self,
+        command_id: CommandId,
+        task_id: TaskId,
+        task_params: TextGenerationTaskParams,
+    ) -> str:
+        """Queue a pre-tokenized request for insertion. Only rank 0 should call this.
+
+        Tokenization happens here (eagerly) so that sync_and_insert_pending()
+        only does the lightweight batch_gen.insert() call, keeping the decode
+        thread unblocked for as long as possible.
+
+        Returns the prompt string for caller use (e.g. thinking-mode detection).
+        """
+        assert self.rank == 0, "Only rank 0 should queue requests"
+        prompt_str = apply_chat_template(self.tokenizer, task_params)
+        tokens: list[int] = self.tokenizer.encode(prompt_str, add_special_tokens=False)
+        max_tokens = task_params.max_output_tokens or self.max_tokens
+        self._pending_inserts.append(
+            PendingInsert(
+                command_id=command_id,
+                task_id=task_id,
+                tokens=tokens,
+                max_tokens=max_tokens,
+                prompt_tokens=len(tokens),
+                temperature=task_params.temperature,
+                top_p=task_params.top_p,
+                top_k=task_params.top_k,
+            )
+        )
+        logger.info(
+            f"Queued request {command_id} for insertion (pending={len(self._pending_inserts)}, prompt_tokens={len(tokens)})"
+        )
+        return prompt_str
+
+    def sync_and_insert_pending(self) -> list[int]:
+        """Sync pre-tokenized pending inserts across ranks and insert them. Returns UIDs.
+
+        Tokens are already prepared by queue_request(), so this method only does
+        the lightweight batch_gen.insert() call plus distributed sync if needed.
+        """
+        inserts_to_process: list[PendingInsert]
+
+        if not self.is_distributed:
+            # Non-distributed: just insert directly from pending
+            inserts_to_process = list(self._pending_inserts)
+        else:
+            # Distributed: broadcast pre-tokenized inserts from rank 0 to all ranks
+            assert self.group is not None
+            inserts_to_process = share_object(
+                self._pending_inserts if self.rank == 0 else None,
+                self.rank,
+                self.group,
+            )
+
+        if not inserts_to_process:
+            self._pending_inserts.clear()
+            return []
+
+        # Update sampler from per-request parameters (last request wins for batch)
+        last = inserts_to_process[-1]
+        self.batch_gen.sampler = make_sampler(  # pyright: ignore[reportAttributeAccessIssue]
+            temp=last.temperature if last.temperature is not None else 0.7,
+            top_p=last.top_p if last.top_p is not None else 1.0,
+            top_k=last.top_k if last.top_k is not None else 0,
+        )
+
+        # Single batched insert for efficient prefill — tokens already prepared
+        all_tokens = [p.tokens for p in inserts_to_process]
+        all_max_tokens = [p.max_tokens for p in inserts_to_process]
+        uids = self.batch_gen.insert(all_tokens, max_tokens=all_max_tokens)
+
+        # Track all inserted requests
+        for i, uid in enumerate(uids):
+            p = inserts_to_process[i]
+            self.active_requests[uid] = ActiveRequest(
+                command_id=p.command_id,
+                task_id=p.task_id,
+                uid=uid,
+                detokenizer=self.tokenizer.detokenizer,
+                prompt_tokens=p.prompt_tokens,
+            )
+            logger.info(
+                f"Inserted request {p.command_id} with uid={uid}, prompt_tokens={p.prompt_tokens}, max_tokens={p.max_tokens}"
+            )
+
+        self._pending_inserts.clear()
+        return uids
+
+    def step(self) -> list[BatchedGenerationResponse]:
+        """Run one decode step. Tracks completions but does not sync - call sync_completions() at budget boundaries."""
+        responses = self.batch_gen.next()
+        if not responses:
+            return []
+
+        results: list[BatchedGenerationResponse] = []
+
+        for r in responses:
+            uid: int = r.uid
+            req = self.active_requests.get(uid)
+            if req is None:
+                logger.warning(f"Received response for unknown uid={uid}")
+                continue
+
+            req.tokens_generated += 1
+
+            # Decode the token
+            token: int = r.token
+            req.detokenizer.add_token(token)
+            text: str = req.detokenizer.last_segment
+
+            stats: GenerationStats | None = None
+            finish_reason: FinishReason | None = None
+
+            raw_finish_reason: str | None = r.finish_reason
+            if raw_finish_reason is not None:
+                # Finalize to get remaining text
+                req.detokenizer.finalize()
+                text = req.detokenizer.last_segment
+
+                elapsed = time.perf_counter() - req.start_time
+                generation_tps = req.tokens_generated / elapsed if elapsed > 0 else 0.0
+
+                stats = GenerationStats(
+                    prompt_tps=0.0,  # Not tracked per-request in batch mode
+                    generation_tps=generation_tps,
+                    prompt_tokens=req.prompt_tokens,
+                    generation_tokens=req.tokens_generated,
+                    peak_memory_usage=Memory.from_gb(mx.get_peak_memory() / 1e9),
+                )
+
+                if raw_finish_reason in get_args(FinishReason):
+                    finish_reason = raw_finish_reason  # pyright: ignore[reportAssignmentType]
+                else:
+                    logger.warning(f"Unknown finish_reason: {raw_finish_reason}")
+                    finish_reason = "stop"
+
+                # Track completion but don't remove yet - wait for sync_completions()
+                self._pending_completions.append(uid)
+                logger.info(
+                    f"Request {req.command_id} completed: {req.tokens_generated} tokens, {generation_tps:.2f} tps, reason={finish_reason}"
+                )
+
+            results.append(
+                BatchedGenerationResponse(
+                    command_id=req.command_id,
+                    task_id=req.task_id,
+                    response=GenerationResponse(
+                        text=text,
+                        token=token,
+                        finish_reason=finish_reason,
+                        stats=stats,
+                        usage=None,
+                    ),
+                )
+            )
+
+        # In non-distributed mode, clean up completions immediately
+        if not self.is_distributed:
+            self._remove_completed()
+
+        return results
+
+    def sync_completions(self) -> None:
+        """Sync and remove completed requests. Call at time budget boundaries in distributed mode."""
+        if not self.is_distributed:
+            # Non-distributed: early return if nothing to do
+            if not self._pending_completions:
+                return
+            self._remove_completed()
+            return
+
+        # Distributed mode: ALWAYS sync to ensure all ranks participate in collective op
+        # This prevents deadlock if one rank has completions and another doesn't
+        assert self.group is not None
+        self._pending_completions = share_object(
+            self._pending_completions if self.rank == 0 else None,
+            self.rank,
+            self.group,
+        )
+        self._remove_completed()
+
+    def _remove_completed(self) -> None:
+        """Remove completed requests from tracking."""
+        for uid in self._pending_completions:
+            if uid in self.active_requests:
+                del self.active_requests[uid]
+        self._pending_completions.clear()
+
+    @property
+    def has_active_requests(self) -> bool:
+        return bool(self.active_requests or self.batch_gen.unprocessed_prompts)
+
+    @property
+    def has_pending_inserts(self) -> bool:
+        return bool(self._pending_inserts)
+
+    @property
+    def active_count(self) -> int:
+        return len(self.active_requests)
+
+    @property
+    def pending_count(self) -> int:
+        return len(self.batch_gen.unprocessed_prompts)
+
+    @property
+    def pending_insert_count(self) -> int:
+        return len(self._pending_inserts)
+
+    @property
+    def has_pending_completions(self) -> bool:
+        return bool(self._pending_completions)
--- a/src/exo/worker/engines/mlx/generator/distributed_sync.py
+++ b/src/exo/worker/engines/mlx/generator/distributed_sync.py
@@ -0,0 +1,34 @@
+"""Distributed sync utilities using mx.distributed.all_sum() to broadcast from rank 0."""
+
+# pyright: reportAny=false
+
+import pickle
+from typing import cast
+
+import mlx.core as mx
+
+
+def share_object[T](obj: T | None, rank: int, group: mx.distributed.Group) -> T:
+    """Broadcast object from rank 0 to all ranks. Two-phase: size then data.
+
+    Rank 0 must always provide a non-None object. Non-rank-0 callers pass None
+    (they are receivers only). Use mx_barrier() instead if no data needs to be shared.
+    """
+    if rank == 0:
+        assert obj is not None, (
+            "Rank 0 must provide data; use mx_barrier() to sync without data"
+        )
+        data = mx.array(list(pickle.dumps(obj)), dtype=mx.uint8)
+        mx.eval(mx.distributed.all_sum(mx.array([data.size]), group=group))
+        mx.eval(mx.distributed.all_sum(data, group=group))
+        return obj
+    else:
+        size = int(mx.distributed.all_sum(mx.array([0]), group=group).item())
+        if size == 0:
+            raise RuntimeError(
+                "share_object received size=0 from rank 0 — protocol violation"
+            )
+        data = mx.zeros(size, dtype=mx.uint8)
+        data = mx.distributed.all_sum(data, group=group)
+        mx.eval(data)
+        return cast(T, pickle.loads(bytes(cast(list[int], data.tolist()))))
--- a/src/exo/worker/engines/mlx/generator/generate.py
+++ b/src/exo/worker/engines/mlx/generator/generate.py
@@ -306,7 +306,7 @@ def mlx_generate(
    max_stop_len = max((len(s) for s in stop_sequences), default=0)

    mx_barrier(group)
-    logger.info("Starting prefill")
+    logger.info("Ready to prefill")

    # Prefill cache with all tokens except the last one
    prefill_tps, prefill_tokens, ssm_snapshots_list = prefill(
@@ -393,11 +393,10 @@ def mlx_generate(
                    f"Model generated unexpected finish_reason: {out.finish_reason}"
                )

-            total_prompt_tokens = len(all_prompt_tokens)
            usage = Usage(
-                prompt_tokens=total_prompt_tokens,
+                prompt_tokens=int(out.prompt_tokens),
                completion_tokens=completion_tokens,
-                total_tokens=total_prompt_tokens + completion_tokens,
+                total_tokens=int(out.prompt_tokens) + completion_tokens,
                prompt_tokens_details=PromptTokensDetails(
                    cached_tokens=prefix_hit_length
                ),
--- a/src/exo/worker/engines/mlx/generator/time_budget.py
+++ b/src/exo/worker/engines/mlx/generator/time_budget.py
@@ -0,0 +1,104 @@
+"""Time budget iterator for controlling generation loop timing in distributed mode.
+
+Based on mlx-lm's TimeBudget pattern - runs for a time budget then syncs,
+rather than syncing every token. This reduces distributed sync overhead.
+"""
+
+import time
+from typing import Iterator
+
+import mlx.core as mx
+
+from exo.worker.runner.bootstrap import logger
+
+generation_stream = mx.new_stream(mx.default_device())
+
+
+class TimeBudget(Iterator[None]):
+    """Controls generation loop timing, syncing across ranks periodically.
+
+    In distributed mode, periodically syncs timing across all ranks to
+    dynamically adjust iteration count based on actual performance.
+
+    In non-distributed mode, simply runs for the time budget.
+
+    Usage:
+        for _ in TimeBudget(budget=0.5):
+            batch_engine.step()
+            # ... process responses ...
+    """
+
+    def __init__(
+        self,
+        budget: float = 0.5,
+        iterations: int = 25,
+        sync_frequency: int = 10,
+        group: mx.distributed.Group | None = None,
+    ):
+        """Initialize TimeBudget.
+
+        Args:
+            budget: Time budget in seconds before yielding control
+            iterations: Initial number of iterations per budget period (distributed only)
+            sync_frequency: How often to sync timing across ranks (distributed only)
+            group: Distributed group, or None for non-distributed mode
+        """
+        self._budget = budget
+        self._iterations = iterations
+        self._sync_frequency = sync_frequency
+        self._group = group
+        self._is_distributed = group is not None and group.size() > 1
+
+        # Runtime state
+        self._start: float = 0.0
+        self._current_iterations: int = 0
+        self._loops: int = 0
+        self._time_spent: float = 0.0
+
+    def __iter__(self) -> "TimeBudget":
+        self._start = time.perf_counter()
+        self._current_iterations = 0
+        return self
+
+    def __next__(self) -> None:
+        if not self._is_distributed:
+            # Non-distributed: just check time budget
+            if time.perf_counter() - self._start > self._budget:
+                raise StopIteration()
+            return None
+
+        # Distributed mode: iteration-based with periodic timing sync
+        self._current_iterations += 1
+        if self._current_iterations > self._iterations:
+            self._loops += 1
+            self._time_spent += time.perf_counter() - self._start
+
+            if self._loops % self._sync_frequency == 0:
+                # Sync timing across all ranks
+                assert self._group is not None
+                with mx.stream(generation_stream):
+                    time_array = mx.array([self._time_spent], dtype=mx.float32)
+                    total_time = mx.distributed.all_sum(time_array, group=self._group)
+                    mx.eval(total_time)
+                    loop_time = float(total_time.item())
+
+                avg_loop_time = loop_time / (self._group.size() * self._sync_frequency)
+
+                if avg_loop_time > 0:
+                    factor = self._budget / avg_loop_time
+                    self._iterations = max(round(self._iterations * factor), 1)
+                    logger.debug(
+                        f"TimeBudget adjusted iterations to {self._iterations}"
+                    )
+
+                self._loops = 0
+                self._time_spent = 0.0
+
+            raise StopIteration()
+
+        return None
+
+    @property
+    def iterations(self) -> int:
+        """Current iterations per budget period."""
+        return self._iterations
--- a/src/exo/worker/engines/mlx/utils_mlx.py
+++ b/src/exo/worker/engines/mlx/utils_mlx.py
@@ -64,6 +64,8 @@ from exo.worker.runner.bootstrap import logger
 Group = mx.distributed.Group


+# TODO: Test this
+#  ALSO https://github.com/exo-explore/exo/pull/233#discussion_r2549683673
 def get_weights_size(model_shard_meta: ShardMetadata) -> Memory:
    return Memory.from_float_kb(
        (model_shard_meta.end_layer - model_shard_meta.start_layer)
@@ -81,6 +83,30 @@ class ModelLoadingTimeoutError(Exception):
    pass


+def mx_barrier(group: Group | None = None):
+    mx.eval(
+        mx.distributed.all_sum(
+            mx.array(1.0),
+            stream=mx.default_stream(mx.Device(mx.cpu)),
+            group=group,
+        )
+    )
+
+
+def broadcast_from_zero(value: int, group: Group | None = None):
+    if group is None:
+        return value
+
+    if group.rank() == 0:
+        a = mx.array([value], dtype=mx.int32)
+    else:
+        a = mx.array([0], dtype=mx.int32)
+
+    m = mx.distributed.all_sum(a, stream=mx.Device(mx.DeviceType.cpu), group=group)
+    mx.eval(m)
+    return int(m.item())
+
+
 class HostList(RootModel[list[str]]):
    @classmethod
    def from_hosts(cls, hosts: list[Host]) -> "HostList":
@@ -353,13 +379,7 @@ def load_tokenizer_for_model_id(
            return list(hf_tokenizer.model.encode(text, allowed_special="all"))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]

        hf_tokenizer.encode = _patched_encode
-        return TokenizerWrapper(
-            hf_tokenizer,
-            eos_token_ids=eos_token_ids,
-            tool_call_start="<|tool_calls_section_begin|>",
-            tool_call_end="<|tool_calls_section_end|>",
-            tool_parser=_parse_kimi_tool_calls,
-        )
+        return TokenizerWrapper(hf_tokenizer, eos_token_ids=eos_token_ids)

    tokenizer = load_tokenizer(
        model_path,
@@ -571,61 +591,3 @@ def mlx_cleanup(
    import gc

    gc.collect()
-
-
-def mx_any(bool_: bool, group: Group | None) -> bool:
-    if group is None:
-        return bool_
-    num_true = mx.distributed.all_sum(
-        mx.array(bool_), group=group, stream=mx.default_stream(mx.Device(mx.cpu))
-    )
-    mx.eval(num_true)
-    return num_true.item() > 0
-
-
-def mx_barrier(group: Group | None):
-    if group is None:
-        return
-    mx.eval(
-        mx.distributed.all_sum(
-            mx.array(1.0), group=group, stream=mx.default_stream(mx.Device(mx.cpu))
-        )
-    )
-
-
-def _parse_kimi_tool_calls(text: str):
-    import regex as re
-
-    # kimi has a fixed function naming scheme, with a json formatted arg
-    #   functions.multiply:0<|tool_call_argument_begin|>{"a": 2, "b": 3}
-    _func_name_regex = re.compile(
-        r"^\s*((?:functions\.)?(.+?):\d+)\s*<\|tool_call_argument_begin\|>", re.DOTALL
-    )
-    _func_arg_regex = re.compile(r"<\|tool_call_argument_begin\|>\s*(.*)\s*", re.DOTALL)
-    _tool_call_split_regex = re.compile(
-        r"<\|tool_call_begin\|>(.*?)<\|tool_call_end\|>", re.DOTALL
-    )
-
-    def _parse_single_tool(text: str) -> dict[str, Any]:
-        func_name_match = _func_name_regex.search(text)
-        if func_name_match is None:
-            raise ValueError("No tool call found.")
-        tool_call_id = func_name_match.group(1)  # e.g. "functions.get_weather:0"
-        func_name = func_name_match.group(2)  # e.g. "get_weather"
-
-        func_args_match = _func_arg_regex.search(text)
-        if func_args_match is None:
-            raise ValueError("No tool call arguments found.")
-        func_args = func_args_match.group(1)
-        try:
-            arg_dct = json.loads(func_args)  # pyright: ignore[reportAny]
-        except Exception:
-            arg_dct = None
-
-        return dict(id=tool_call_id, name=func_name, arguments=arg_dct)
-
-    tool_matches = _tool_call_split_regex.findall(text)
-    if tool_matches:
-        return [_parse_single_tool(match) for match in tool_matches]  # pyright: ignore[reportAny]
-    else:
-        return [_parse_single_tool(text)]
--- a/src/exo/worker/main.py
+++ b/src/exo/worker/main.py
@@ -33,7 +33,6 @@ from exo.shared.types.events import (
 from exo.shared.types.multiaddr import Multiaddr
 from exo.shared.types.state import State
 from exo.shared.types.tasks import (
-    CancelTask,
    CreateRunner,
    DownloadModel,
    ImageEdits,
@@ -173,123 +172,127 @@ class Worker:
    async def plan_step(self):
        while True:
            await anyio.sleep(0.1)
-            task: Task | None = plan(
-                self.node_id,
-                self.runners,
-                self.state.downloads,
-                self.state.instances,
-                self.state.runners,
-                self.state.tasks,
-                self.input_chunk_buffer,
-                self.input_chunk_counts,
-            )
-            if task is None:
-                continue
+            # Drain all available tasks before sleeping again.
+            # This ensures concurrent request arrivals are dispatched
+            # rapidly rather than one-per-100ms.
+            while True:
+                task: Task | None = plan(
+                    self.node_id,
+                    self.runners,
+                    self.state.downloads,
+                    self.state.instances,
+                    self.state.runners,
+                    self.state.tasks,
+                    self.input_chunk_buffer,
+                    self.input_chunk_counts,
+                )
+                if task is None:
+                    break

-            # Gate DownloadModel on backoff BEFORE emitting TaskCreated
-            # to prevent flooding the event log with useless events
-            if isinstance(task, DownloadModel):
-                model_id = task.shard_metadata.model_card.model_id
-                if not self._download_backoff.should_proceed(model_id):
-                    continue
+                # Gate DownloadModel on backoff BEFORE emitting TaskCreated
+                # to prevent flooding the event log with useless events
+                if isinstance(task, DownloadModel):
+                    model_id = task.shard_metadata.model_card.model_id
+                    if not self._download_backoff.should_proceed(model_id):
+                        break

-            logger.info(f"Worker plan: {task.__class__.__name__}")
-            assert task.task_status
-            await self.event_sender.send(TaskCreated(task_id=task.task_id, task=task))
+                logger.info(f"Worker plan: {task.__class__.__name__}")
+                assert task.task_status
+                await self.event_sender.send(
+                    TaskCreated(task_id=task.task_id, task=task)
+                )

-            # lets not kill the worker if a runner is unresponsive
-            match task:
-                case CreateRunner():
-                    self._create_supervisor(task)
-                    await self.event_sender.send(
-                        TaskStatusUpdated(
-                            task_id=task.task_id, task_status=TaskStatus.Complete
-                        )
-                    )
-                case DownloadModel(shard_metadata=shard):
-                    model_id = shard.model_card.model_id
-                    self._download_backoff.record_attempt(model_id)
-
-                    await self.download_command_sender.send(
-                        ForwarderDownloadCommand(
-                            origin=self.node_id,
-                            command=StartDownload(
-                                target_node_id=self.node_id,
-                                shard_metadata=shard,
-                            ),
-                        )
-                    )
-                    await self.event_sender.send(
-                        TaskStatusUpdated(
-                            task_id=task.task_id, task_status=TaskStatus.Running
-                        )
-                    )
-                case Shutdown(runner_id=runner_id):
-                    runner = self.runners.pop(runner_id)
-                    try:
-                        with fail_after(3):
-                            await runner.start_task(task)
-                    except TimeoutError:
+                # lets not kill the worker if a runner is unresponsive
+                match task:
+                    case CreateRunner():
+                        self._create_supervisor(task)
                        await self.event_sender.send(
                            TaskStatusUpdated(
-                                task_id=task.task_id, task_status=TaskStatus.TimedOut
+                                task_id=task.task_id,
+                                task_status=TaskStatus.Complete,
                            )
                        )
-                    finally:
-                        runner.shutdown()
-                case CancelTask(
-                    cancelled_task_id=cancelled_task_id, runner_id=runner_id
-                ):
-                    await self.runners[runner_id].cancel_task(cancelled_task_id)
-                case ImageEdits() if task.task_params.total_input_chunks > 0:
-                    # Assemble image from chunks and inject into task
-                    cmd_id = task.command_id
-                    chunks = self.input_chunk_buffer.get(cmd_id, {})
-                    assembled = "".join(chunks[i] for i in range(len(chunks)))
-                    logger.info(
-                        f"Assembled input image from {len(chunks)} chunks, "
-                        f"total size: {len(assembled)} bytes"
-                    )
-                    # Create modified task with assembled image data
-                    modified_task = ImageEdits(
-                        task_id=task.task_id,
-                        command_id=task.command_id,
-                        instance_id=task.instance_id,
-                        task_status=task.task_status,
-                        task_params=ImageEditsTaskParams(
-                            image_data=assembled,
-                            total_input_chunks=task.task_params.total_input_chunks,
-                            prompt=task.task_params.prompt,
-                            model=task.task_params.model,
-                            n=task.task_params.n,
-                            quality=task.task_params.quality,
-                            output_format=task.task_params.output_format,
-                            response_format=task.task_params.response_format,
-                            size=task.task_params.size,
-                            image_strength=task.task_params.image_strength,
-                            bench=task.task_params.bench,
-                            stream=task.task_params.stream,
-                            partial_images=task.task_params.partial_images,
-                            advanced_params=task.task_params.advanced_params,
-                        ),
-                    )
-                    # Cleanup buffers
-                    if cmd_id in self.input_chunk_buffer:
-                        del self.input_chunk_buffer[cmd_id]
-                    if cmd_id in self.input_chunk_counts:
-                        del self.input_chunk_counts[cmd_id]
-                    await self._start_runner_task(modified_task)
-                case task:
-                    await self._start_runner_task(task)
+                    case DownloadModel(shard_metadata=shard):
+                        model_id = shard.model_card.model_id
+                        self._download_backoff.record_attempt(model_id)
+
+                        await self.download_command_sender.send(
+                            ForwarderDownloadCommand(
+                                origin=self.node_id,
+                                command=StartDownload(
+                                    target_node_id=self.node_id,
+                                    shard_metadata=shard,
+                                ),
+                            )
+                        )
+                        await self.event_sender.send(
+                            TaskStatusUpdated(
+                                task_id=task.task_id,
+                                task_status=TaskStatus.Running,
+                            )
+                        )
+                    case Shutdown(runner_id=runner_id):
+                        try:
+                            with fail_after(3):
+                                await self.runners.pop(runner_id).start_task(task)
+                        except TimeoutError:
+                            await self.event_sender.send(
+                                TaskStatusUpdated(
+                                    task_id=task.task_id,
+                                    task_status=TaskStatus.TimedOut,
+                                )
+                            )
+                    case ImageEdits() if task.task_params.total_input_chunks > 0:
+                        # Assemble image from chunks and inject into task
+                        cmd_id = task.command_id
+                        chunks = self.input_chunk_buffer.get(cmd_id, {})
+                        assembled = "".join(chunks[i] for i in range(len(chunks)))
+                        logger.info(
+                            f"Assembled input image from {len(chunks)} chunks, "
+                            f"total size: {len(assembled)} bytes"
+                        )
+                        # Create modified task with assembled image data
+                        modified_task = ImageEdits(
+                            task_id=task.task_id,
+                            command_id=task.command_id,
+                            instance_id=task.instance_id,
+                            task_status=task.task_status,
+                            task_params=ImageEditsTaskParams(
+                                image_data=assembled,
+                                total_input_chunks=task.task_params.total_input_chunks,
+                                prompt=task.task_params.prompt,
+                                model=task.task_params.model,
+                                n=task.task_params.n,
+                                quality=task.task_params.quality,
+                                output_format=task.task_params.output_format,
+                                response_format=task.task_params.response_format,
+                                size=task.task_params.size,
+                                image_strength=task.task_params.image_strength,
+                                bench=task.task_params.bench,
+                                stream=task.task_params.stream,
+                                partial_images=task.task_params.partial_images,
+                                advanced_params=task.task_params.advanced_params,
+                            ),
+                        )
+                        # Cleanup buffers
+                        if cmd_id in self.input_chunk_buffer:
+                            del self.input_chunk_buffer[cmd_id]
+                        if cmd_id in self.input_chunk_counts:
+                            del self.input_chunk_counts[cmd_id]
+                        await self.runners[self._task_to_runner_id(task)].start_task(
+                            modified_task
+                        )
+                    case task:
+                        await self.runners[self._task_to_runner_id(task)].start_task(
+                            task
+                        )

    def shutdown(self):
        self._tg.cancel_scope.cancel()

-    async def _start_runner_task(self, task: Task):
-        if (instance := self.state.instances.get(task.instance_id)) is not None:
-            await self.runners[
-                instance.shard_assignments.node_to_runner[self.node_id]
-            ].start_task(task)
+    def _task_to_runner_id(self, task: Task):
+        instance = self.state.instances[task.instance_id]
+        return instance.shard_assignments.node_to_runner[self.node_id]

    async def _nack_request(self, since_idx: int) -> None:
        # We request all events after (and including) the missing index.
@@ -328,6 +331,8 @@ class Worker:
            for event in self.out_for_delivery.copy().values():
                await self.local_event_sender.send(event)

+    ## Op Executors
+
    def _create_supervisor(self, task: CreateRunner) -> RunnerSupervisor:
        """Creates and stores a new AssignedRunner with initial downloading status."""
        runner = RunnerSupervisor.create(
--- a/src/exo/worker/plan.py
+++ b/src/exo/worker/plan.py
@@ -4,7 +4,6 @@ from collections.abc import Mapping, Sequence

 from exo.shared.types.common import CommandId, NodeId
 from exo.shared.types.tasks import (
-    CancelTask,
    ConnectToGroup,
    CreateRunner,
    DownloadModel,
@@ -54,14 +53,13 @@ def plan(
 ) -> Task | None:
    # Python short circuiting OR logic should evaluate these sequentially.
    return (
-        _cancel_tasks(runners, tasks)
-        or _kill_runner(runners, all_runners, instances)
+        _kill_runner(runners, all_runners, instances)
        or _create_runner(node_id, runners, instances)
        or _model_needs_download(node_id, runners, global_download_status)
        or _init_distributed_backend(runners, all_runners)
        or _load_model(runners, all_runners, global_download_status)
        or _ready_to_warmup(runners, all_runners)
-        or _pending_tasks(runners, tasks, all_runners, input_chunk_buffer or {})
+        or _pending_tasks(runners, tasks, all_runners, input_chunk_buffer)
    )


@@ -272,7 +270,7 @@ def _pending_tasks(
    runners: Mapping[RunnerId, RunnerSupervisor],
    tasks: Mapping[TaskId, Task],
    all_runners: Mapping[RunnerId, RunnerStatus],
-    input_chunk_buffer: Mapping[CommandId, dict[int, str]],
+    input_chunk_buffer: Mapping[CommandId, dict[int, str]] | None = None,
 ) -> Task | None:
    for task in tasks.values():
        # for now, just forward chat completions
@@ -286,7 +284,7 @@ def _pending_tasks(
        if isinstance(task, ImageEdits) and task.task_params.total_input_chunks > 0:
            cmd_id = task.command_id
            expected = task.task_params.total_input_chunks
-            received = len(input_chunk_buffer.get(cmd_id, {}))
+            received = len((input_chunk_buffer or {}).get(cmd_id, {}))
            if received < expected:
                continue  # Wait for all chunks to arrive

@@ -294,33 +292,18 @@ def _pending_tasks(
            if task.instance_id != runner.bound_instance.instance.instance_id:
                continue

-            # the task status _should_ be set to completed by the LAST runner
-            # it is currently set by the first
-            # this is definitely a hack
-            if task.task_id in runner.completed:
+            # I have a design point here; this is a state race in disguise as the task status doesn't get updated to completed fast enough
+            # however, realistically the task status should be set to completed by the LAST runner, so this is a true race
+            # the actual solution is somewhat deeper than this bypass - TODO!
+            # Also skip tasks in pending to prevent duplicate forwarding with continuous batching
+            if task.task_id in runner.completed or task.task_id in runner.pending:
                continue

-            if isinstance(runner.status, RunnerReady) and all(
+            # TODO: Check ordering aligns with MLX distributeds expectations.
+
+            # Allow forwarding tasks when runner is Ready or Running (for continuous batching)
+            if isinstance(runner.status, (RunnerReady, RunnerRunning)) and all(
                isinstance(all_runners[global_runner_id], (RunnerReady, RunnerRunning))
                for global_runner_id in runner.bound_instance.instance.shard_assignments.runner_to_shard
            ):
                return task
-
-
-def _cancel_tasks(
-    runners: Mapping[RunnerId, RunnerSupervisor],
-    tasks: Mapping[TaskId, Task],
-) -> Task | None:
-    for task in tasks.values():
-        if task.task_status != TaskStatus.Cancelled:
-            continue
-        for runner_id, runner in runners.items():
-            if task.instance_id != runner.bound_instance.instance.instance_id:
-                continue
-            if task.task_id in runner.cancelled:
-                continue
-            return CancelTask(
-                instance_id=task.instance_id,
-                cancelled_task_id=task.task_id,
-                runner_id=runner_id,
-            )
--- a/src/exo/worker/runner/bootstrap.py
+++ b/src/exo/worker/runner/bootstrap.py
@@ -3,7 +3,7 @@ import os
 import loguru

 from exo.shared.types.events import Event, RunnerStatusUpdated
-from exo.shared.types.tasks import Task, TaskId
+from exo.shared.types.tasks import Task
 from exo.shared.types.worker.instances import BoundInstance, MlxJacclInstance
 from exo.shared.types.worker.runners import RunnerFailed
 from exo.utils.channels import ClosedResourceError, MpReceiver, MpSender
@@ -15,7 +15,6 @@ def entrypoint(
    bound_instance: BoundInstance,
    event_sender: MpSender[Event],
    task_receiver: MpReceiver[Task],
-    cancel_receiver: MpReceiver[TaskId],
    _logger: "loguru.Logger",
 ) -> None:
    fast_synch_override = os.environ.get("EXO_FAST_SYNCH")
@@ -39,7 +38,7 @@ def entrypoint(
    try:
        from exo.worker.runner.runner import main

-        main(bound_instance, event_sender, task_receiver, cancel_receiver)
+        main(bound_instance, event_sender, task_receiver)
    except ClosedResourceError:
        logger.warning("Runner communication closed unexpectedly")
    except Exception as e:
--- a/src/exo/worker/runner/runner.py
+++ b/src/exo/worker/runner/runner.py
--- a/src/exo/worker/runner/runner_supervisor.py
+++ b/src/exo/worker/runner/runner_supervisor.py
@@ -47,11 +47,9 @@ class RunnerSupervisor:
    _ev_recv: MpReceiver[Event]
    _task_sender: MpSender[Task]
    _event_sender: Sender[Event]
-    _cancel_sender: MpSender[TaskId]
    status: RunnerStatus = field(default_factory=RunnerIdle, init=False)
    pending: dict[TaskId, anyio.Event] = field(default_factory=dict, init=False)
    completed: set[TaskId] = field(default_factory=set, init=False)
-    cancelled: set[TaskId] = field(default_factory=set, init=False)

    @classmethod
    def create(
@@ -62,8 +60,8 @@ class RunnerSupervisor:
        initialize_timeout: float = 400,
    ) -> Self:
        ev_send, ev_recv = mp_channel[Event]()
+        # A task is kind of a runner command
        task_sender, task_recv = mp_channel[Task]()
-        cancel_sender, cancel_recv = mp_channel[TaskId]()

        runner_process = Process(
            target=entrypoint,
@@ -71,7 +69,6 @@ class RunnerSupervisor:
                bound_instance,
                ev_send,
                task_recv,
-                cancel_recv,
                logger,
            ),
            daemon=True,
@@ -86,7 +83,6 @@ class RunnerSupervisor:
            initialize_timeout=initialize_timeout,
            _ev_recv=ev_recv,
            _task_sender=task_sender,
-            _cancel_sender=cancel_sender,
            _event_sender=event_sender,
        )

@@ -101,8 +97,6 @@ class RunnerSupervisor:
        self._ev_recv.close()
        self._task_sender.close()
        self._event_sender.close()
-        self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK"))
-        self._cancel_sender.close()
        self.runner_process.join(1)
        if not self.runner_process.is_alive():
            logger.info("Runner process succesfully terminated")
@@ -118,6 +112,14 @@ class RunnerSupervisor:
        logger.critical("Runner process didn't respond to SIGTERM, killing")
        self.runner_process.kill()

+        self.runner_process.join(1)
+        if not self.runner_process.is_alive():
+            return
+
+        logger.critical(
+            "Runner process didn't respond to SIGKILL. System resources may have leaked"
+        )
+
    async def start_task(self, task: Task):
        if task.task_id in self.pending:
            logger.warning(
@@ -139,17 +141,6 @@ class RunnerSupervisor:
            return
        await event.wait()

-    async def cancel_task(self, task_id: TaskId):
-        if task_id in self.completed:
-            logger.info(f"Unable to cancel {task_id} as it has been completed")
-            return
-        self.cancelled.add(task_id)
-        with anyio.move_on_after(0.5) as scope:
-            await self._cancel_sender.send_async(task_id)
-        if scope.cancel_called:
-            logger.error("RunnerSupervisor cancel pipe blocked")
-            await self._check_runner(TimeoutError("cancel pipe blocked"))
-
    async def _forward_events(self):
        with self._ev_recv as events:
            try:
@@ -157,7 +148,11 @@ class RunnerSupervisor:
                    if isinstance(event, RunnerStatusUpdated):
                        self.status = event.runner_status
                    if isinstance(event, TaskAcknowledged):
-                        self.pending.pop(event.task_id).set()
+                        # Signal start_task() to return, but keep the entry
+                        # in self.pending so _pending_tasks won't re-dispatch.
+                        pending_event = self.pending.get(event.task_id)
+                        if pending_event is not None:
+                            pending_event.set()
                        continue
                    if (
                        isinstance(event, TaskStatusUpdated)
@@ -175,6 +170,8 @@ class RunnerSupervisor:
                            ),
                        )
                        self.completed.add(event.task_id)
+                        # Clean up from pending now that it's fully complete
+                        self.pending.pop(event.task_id, None)
                    await self._event_sender.send(event)
            except (ClosedResourceError, BrokenResourceError) as e:
                await self._check_runner(e)
--- a/src/exo/worker/runner/tool_parsers.py
+++ b/src/exo/worker/runner/tool_parsers.py
@@ -1,72 +0,0 @@
-import json
-from dataclasses import dataclass
-from typing import Any, Callable
-
-from exo.shared.types.api import ToolCallItem
-
-
-@dataclass
-class ToolParser:
-    start_parsing: str
-    end_parsing: str
-    parse_tool_calls: Callable[[str], list[ToolCallItem] | None]
-
-
-def make_mlx_parser(
-    tool_call_start: str,
-    tool_call_end: str,
-    tool_parser: Callable[[str], dict[str, Any] | list[dict[str, Any]]],
-) -> ToolParser:
-    def parse_tool_calls(text: str) -> list[ToolCallItem] | None:
-        try:
-            text = text.removeprefix(tool_call_start)
-            text = text.removesuffix(tool_call_end)
-            parsed = tool_parser(text)
-            if isinstance(parsed, list):
-                return [ToolCallItem.model_validate(_flatten(p)) for p in parsed]
-            else:
-                return [ToolCallItem.model_validate(_flatten(parsed))]
-
-        except Exception:
-            return None
-
-    return ToolParser(
-        start_parsing=tool_call_start,
-        end_parsing=tool_call_end,
-        parse_tool_calls=parse_tool_calls,
-    )
-
-
-# TODO / example code:
-def _parse_json_calls(text: str) -> list[ToolCallItem] | None:
-    try:
-        text = text.removeprefix("<tool_call>")
-        text = text.removesuffix("</tool_call>")
-        top_level = {
-            k: json.dumps(v) if isinstance(v, (dict, list)) else v
-            for k, v in json.loads(text).items()  # pyright: ignore[reportAny]
-        }
-        return [ToolCallItem.model_validate(top_level)]
-    except Exception:
-        return None
-
-
-def _flatten(p: dict[str, Any]) -> dict[str, str]:
-    return {
-        k: json.dumps(v) if isinstance(v, (dict, list)) else str(v)  # pyright: ignore[reportAny]
-        for k, v in p.items()  # pyright: ignore[reportAny]
-    }
-
-
-json_tool_parser = ToolParser(
-    start_parsing="<tool_call>",
-    end_parsing="</tool_call>",
-    parse_tool_calls=_parse_json_calls,
-)
-
-
-def infer_tool_parser(chat_template: str) -> ToolParser | None:
-    """Attempt to auto-infer a tool parser from the chat template."""
-    if "<tool_call>" in chat_template and "tool_call.name" in chat_template:
-        return json_tool_parser
-    return None
--- a/src/exo/worker/tests/unittests/conftest.py
+++ b/src/exo/worker/tests/unittests/conftest.py
@@ -20,6 +20,7 @@ class FakeRunnerSupervisor:
    bound_instance: BoundInstance
    status: RunnerStatus
    completed: set[TaskId] = field(default_factory=set)
+    pending: dict[TaskId, object] = field(default_factory=dict)


 class OtherTask(BaseTask):
--- a/src/exo/worker/tests/unittests/test_runner/test_continuous_batching.py
+++ b/src/exo/worker/tests/unittests/test_runner/test_continuous_batching.py
@@ -0,0 +1,388 @@
+"""
+Tests for continuous batching behavior in the runner.
+
+These tests verify that:
+1. Single requests work through the batch path
+2. Multiple concurrent requests batch together
+3. Tokens are routed to the correct requests
+4. Requests complete at different times appropriately
+
+NOTE: These tests require the continuous-batching runner architecture
+(BatchGenerationEngine) which is not yet integrated with main.
+"""
+
+# ruff: noqa: E402
+# pyright: reportAny=false
+# pyright: reportUnknownArgumentType=false
+# pyright: reportUnknownMemberType=false
+# pyright: reportAttributeAccessIssue=false
+# pyright: reportInvalidTypeVarUse=false
+
+from typing import Any
+
+import pytest
+
+import exo.worker.runner.runner as mlx_runner
+from exo.shared.types.chunks import TokenChunk
+from exo.shared.types.common import CommandId, NodeId
+from exo.shared.types.events import (
+    ChunkGenerated,
+    Event,
+    RunnerStatusUpdated,
+    TaskStatusUpdated,
+)
+from exo.shared.types.tasks import (
+    ConnectToGroup,
+    LoadModel,
+    Shutdown,
+    StartWarmup,
+    Task,
+    TaskId,
+    TaskStatus,
+    TextGeneration,
+)
+from exo.shared.types.text_generation import InputMessage, TextGenerationTaskParams
+from exo.shared.types.worker.runner_response import GenerationResponse
+from exo.shared.types.worker.runners import RunnerRunning
+from exo.utils.channels import mp_channel
+from exo.worker.engines.mlx.generator.batch_engine import (
+    BatchedGenerationResponse,
+)
+from exo.worker.tests.constants import (
+    INSTANCE_1_ID,
+    MODEL_A_ID,
+    NODE_A,
+    RUNNER_1_ID,
+)
+from exo.worker.tests.unittests.conftest import get_bound_mlx_ring_instance
+
+
+class FakeBatchEngineWithTokens:
+    """
+    Fake batch engine that generates a specified number of tokens per request.
+
+    This simulates realistic batch generation behavior where:
+    - Requests are queued on insert
+    - Each step() call generates one token for all active requests
+    - Requests complete when they've generated all their tokens
+    """
+
+    def __init__(self, *_args: Any, **_kwargs: Any):
+        self._active_requests: dict[int, tuple[CommandId, TaskId, int, int]] = {}
+        self._pending_inserts: list[
+            tuple[CommandId, TaskId, TextGenerationTaskParams]
+        ] = []
+        self._uid_counter = 0
+        self._tokens_per_request = 3  # Default: generate 3 tokens before completing
+        self.rank = 0  # Fake rank for testing
+
+    def queue_request(
+        self,
+        command_id: CommandId,
+        task_id: TaskId,
+        task_params: TextGenerationTaskParams,
+    ) -> str:
+        """Queue a request for insertion."""
+        self._pending_inserts.append((command_id, task_id, task_params))
+        return ""
+
+    def sync_and_insert_pending(self) -> list[int]:
+        """Insert all pending requests."""
+        uids: list[int] = []
+        for command_id, task_id, task_params in self._pending_inserts:
+            uid = self._do_insert(command_id, task_id, task_params)
+            uids.append(uid)
+        self._pending_inserts.clear()
+        return uids
+
+    @property
+    def has_pending_inserts(self) -> bool:
+        return len(self._pending_inserts) > 0
+
+    def _do_insert(
+        self,
+        command_id: CommandId,
+        task_id: TaskId,
+        task_params: TextGenerationTaskParams | None,
+    ) -> int:
+        uid = self._uid_counter
+        self._uid_counter += 1
+        # Track: (command_id, task_id, tokens_generated, max_tokens)
+        max_tokens = (
+            task_params.max_output_tokens if task_params else self._tokens_per_request
+        )
+        self._active_requests[uid] = (command_id, task_id, 0, max_tokens or 3)
+        return uid
+
+    def step(self) -> list[BatchedGenerationResponse]:
+        results: list[BatchedGenerationResponse] = []
+        uids_to_remove: list[int] = []
+
+        for uid, (command_id, task_id, tokens_gen, max_tokens) in list(
+            self._active_requests.items()
+        ):
+            tokens_gen += 1
+            finish_reason = "stop" if tokens_gen >= max_tokens else None
+            text = f"token{tokens_gen}"
+
+            if finish_reason:
+                uids_to_remove.append(uid)
+            else:
+                self._active_requests[uid] = (
+                    command_id,
+                    task_id,
+                    tokens_gen,
+                    max_tokens,
+                )
+
+            results.append(
+                BatchedGenerationResponse(
+                    command_id=command_id,
+                    task_id=task_id,
+                    response=GenerationResponse(
+                        token=tokens_gen,
+                        text=text,
+                        finish_reason=finish_reason,
+                        usage=None,
+                    ),
+                )
+            )
+
+        for uid in uids_to_remove:
+            del self._active_requests[uid]
+
+        return results
+
+    @property
+    def has_active_requests(self) -> bool:
+        return len(self._active_requests) > 0
+
+    @property
+    def active_count(self) -> int:
+        return len(self._active_requests)
+
+    @property
+    def pending_insert_count(self) -> int:
+        return len(self._pending_inserts)
+
+    def sync_completions(self) -> None:
+        pass  # Completions already removed in step()
+
+    @property
+    def is_distributed(self) -> bool:
+        return False  # Non-distributed mode for testing
+
+
+class MockTokenizer:
+    """Mock tokenizer with tool calling disabled."""
+
+    tool_parser = None
+    tool_call_start = None
+    tool_call_end = None
+    has_tool_calling = False
+    has_thinking = False
+
+
+class FakeGroup:
+    """Fake MLX distributed group for testing."""
+
+    def rank(self) -> int:
+        return 0
+
+    def size(self) -> int:
+        return 1  # Single node (non-distributed)
+
+
+def make_nothin[T, U, V](res: T):
+    def nothin(*_1: U, **_2: V) -> T:
+        return res
+
+    return nothin
+
+
+@pytest.fixture
+def patch_batch_engine(monkeypatch: pytest.MonkeyPatch):
+    """Patch MLX dependencies and use FakeBatchEngineWithTokens."""
+    monkeypatch.setattr(mlx_runner, "initialize_mlx", make_nothin(FakeGroup()))
+    monkeypatch.setattr(mlx_runner, "load_mlx_items", make_nothin((1, MockTokenizer)))
+    monkeypatch.setattr(mlx_runner, "warmup_inference", make_nothin(1))
+    monkeypatch.setattr(mlx_runner, "_check_for_debug_prompts", make_nothin(None))
+    monkeypatch.setattr(mlx_runner, "BatchGenerationEngine", FakeBatchEngineWithTokens)
+
+
+class EventCollector:
+    """Collects events directly into a list to avoid mp_channel flakiness."""
+
+    def __init__(self) -> None:
+        self.events: list[Event] = []
+
+    def send(self, event: Event) -> None:
+        self.events.append(event)
+
+    def close(self) -> None:
+        pass
+
+    def join(self) -> None:
+        pass
+
+
+def _run_with_tasks(tasks: list[Task]) -> list[Event]:
+    """
+    Run tasks through the runner, adding shutdown at the end.
+
+    Tasks are sent in order, with shutdown sent last.
+    The batch engine processes between task handling.
+    """
+    bound_instance = get_bound_mlx_ring_instance(
+        instance_id=INSTANCE_1_ID,
+        model_id=MODEL_A_ID,
+        runner_id=RUNNER_1_ID,
+        node_id=NodeId(NODE_A),
+    )
+
+    task_sender, task_receiver = mp_channel[Task]()
+    event_collector = EventCollector()
+
+    shutdown_task = Shutdown(
+        task_id=TaskId("shutdown"),
+        instance_id=INSTANCE_1_ID,
+        runner_id=RUNNER_1_ID,
+    )
+
+    with task_sender:
+        # Send all tasks including shutdown
+        for t in tasks:
+            task_sender.send(t)
+        task_sender.send(shutdown_task)
+
+        # Disable cleanup methods to prevent issues
+        task_receiver.close = lambda: None
+        task_receiver.join = lambda: None
+
+        mlx_runner.main(bound_instance, event_collector, task_receiver)  # type: ignore[arg-type]
+
+        return event_collector.events
+
+
+INIT_TASK = ConnectToGroup(task_id=TaskId("init"), instance_id=INSTANCE_1_ID)
+LOAD_TASK = LoadModel(task_id=TaskId("load"), instance_id=INSTANCE_1_ID)
+WARMUP_TASK = StartWarmup(task_id=TaskId("warmup"), instance_id=INSTANCE_1_ID)
+
+
+def make_chat_task(
+    task_id: str, command_id: str, max_tokens: int = 3
+) -> TextGeneration:
+    return TextGeneration(
+        task_id=TaskId(task_id),
+        command_id=CommandId(command_id),
+        task_params=TextGenerationTaskParams(
+            model=MODEL_A_ID,
+            input=[InputMessage(role="user", content="hello")],
+            stream=True,
+            max_output_tokens=max_tokens,
+        ),
+        instance_id=INSTANCE_1_ID,
+    )
+
+
+def test_single_request_generates_tokens(patch_batch_engine: None):
+    """
+    Verify a single request generates the expected tokens through the batch path.
+
+    Tokens are generated during the generation loop (not during shutdown drain).
+    The task completes after all tokens are generated.
+    """
+    chat_task = make_chat_task("chat1", "cmd1", max_tokens=3)
+    events = _run_with_tasks([INIT_TASK, LOAD_TASK, WARMUP_TASK, chat_task])
+
+    # Verify ChunkGenerated events are emitted for all tokens
+    chunk_events = [
+        e
+        for e in events
+        if isinstance(e, ChunkGenerated) and e.command_id == CommandId("cmd1")
+    ]
+    assert len(chunk_events) == 3, (
+        f"Expected 3 ChunkGenerated events, got {len(chunk_events)}"
+    )
+
+    # Last chunk should have finish_reason="stop"
+    last_chunk = chunk_events[-1].chunk
+    assert isinstance(last_chunk, TokenChunk)
+    assert last_chunk.finish_reason == "stop"
+
+    # Task should be marked complete after tokens are generated
+    chat_complete = [
+        e
+        for e in events
+        if isinstance(e, TaskStatusUpdated)
+        and e.task_id == TaskId("chat1")
+        and e.task_status == TaskStatus.Complete
+    ]
+    assert len(chat_complete) == 1, "Expected exactly one chat task Complete status"
+
+
+def test_runner_status_reflects_active_requests(patch_batch_engine: None):
+    """Verify RunnerRunning status includes active_requests count."""
+    chat_task = make_chat_task("chat1", "cmd1", max_tokens=2)
+    events = _run_with_tasks([INIT_TASK, LOAD_TASK, WARMUP_TASK, chat_task])
+
+    # Find RunnerRunning status events
+    running_events = [
+        e
+        for e in events
+        if isinstance(e, RunnerStatusUpdated)
+        and isinstance(e.runner_status, RunnerRunning)
+    ]
+
+    assert len(running_events) > 0, "Expected at least one RunnerRunning event"
+    assert running_events[0].runner_status.active_requests == 1
+
+
+def test_chat_task_acknowledged(patch_batch_engine: None):
+    """Verify chat completion task is acknowledged with proper status updates."""
+    chat_task = make_chat_task("chat1", "cmd1", max_tokens=2)
+    events = _run_with_tasks([INIT_TASK, LOAD_TASK, WARMUP_TASK, chat_task])
+
+    # Find the chat task status events
+    chat_running = [
+        e
+        for e in events
+        if isinstance(e, TaskStatusUpdated)
+        and e.task_id == TaskId("chat1")
+        and e.task_status == TaskStatus.Running
+    ]
+
+    assert len(chat_running) == 1, "Expected exactly one chat task Running status"
+
+
+def test_multiple_requests_generate_tokens(patch_batch_engine: None):
+    """Verify multiple requests each generate their expected tokens."""
+    chat1 = make_chat_task("chat1", "cmd1", max_tokens=2)
+    chat2 = make_chat_task("chat2", "cmd2", max_tokens=2)
+    events = _run_with_tasks([INIT_TASK, LOAD_TASK, WARMUP_TASK, chat1, chat2])
+
+    # Both requests should generate their expected number of tokens
+    cmd1_chunks = [
+        e
+        for e in events
+        if isinstance(e, ChunkGenerated) and e.command_id == CommandId("cmd1")
+    ]
+    cmd2_chunks = [
+        e
+        for e in events
+        if isinstance(e, ChunkGenerated) and e.command_id == CommandId("cmd2")
+    ]
+
+    assert len(cmd1_chunks) == 2, f"Expected 2 chunks for cmd1, got {len(cmd1_chunks)}"
+    assert len(cmd2_chunks) == 2, f"Expected 2 chunks for cmd2, got {len(cmd2_chunks)}"
+
+    # Both tasks should be completed
+    completed_task_ids = {
+        e.task_id
+        for e in events
+        if isinstance(e, TaskStatusUpdated)
+        and e.task_status == TaskStatus.Complete
+        and e.task_id in (TaskId("chat1"), TaskId("chat2"))
+    }
+    assert TaskId("chat1") in completed_task_ids
+    assert TaskId("chat2") in completed_task_ids
--- a/src/exo/worker/tests/unittests/test_runner/test_continuous_batching_edge_cases.py
+++ b/src/exo/worker/tests/unittests/test_runner/test_continuous_batching_edge_cases.py
@@ -0,0 +1,719 @@
+"""
+Edge-case tests for continuous batching in the runner.
+
+Tests cover:
+1. Concurrent requests with overlapping tool calls
+2. Requests that finish mid-generation with 'length' reason
+3. Multiple requests finishing on the same step() call
+4. Batch of 5+ simultaneous completions
+"""
+
+# ruff: noqa: E402
+# pyright: reportAny=false
+# pyright: reportUnknownArgumentType=false
+# pyright: reportUnknownMemberType=false
+# pyright: reportAttributeAccessIssue=false
+# pyright: reportInvalidTypeVarUse=false
+# pyright: reportPrivateUsage=false
+
+import json
+from typing import Any
+from unittest.mock import MagicMock
+
+import pytest
+
+import exo.worker.runner.runner as mlx_runner
+from exo.shared.types.api import FinishReason
+from exo.shared.types.chunks import TokenChunk, ToolCallChunk
+from exo.shared.types.common import CommandId, NodeId
+from exo.shared.types.events import (
+    ChunkGenerated,
+    Event,
+    RunnerStatusUpdated,
+    TaskStatusUpdated,
+)
+from exo.shared.types.tasks import (
+    ConnectToGroup,
+    LoadModel,
+    Shutdown,
+    StartWarmup,
+    Task,
+    TaskId,
+    TaskStatus,
+    TextGeneration,
+)
+from exo.shared.types.text_generation import InputMessage, TextGenerationTaskParams
+from exo.shared.types.worker.runner_response import GenerationResponse
+from exo.shared.types.worker.runners import RunnerReady
+from exo.utils.channels import mp_channel
+from exo.worker.engines.mlx.generator.batch_engine import (
+    BatchedGenerationResponse,
+)
+from exo.worker.tests.constants import (
+    INSTANCE_1_ID,
+    MODEL_A_ID,
+    NODE_A,
+    RUNNER_1_ID,
+)
+from exo.worker.tests.unittests.conftest import get_bound_mlx_ring_instance
+
+# ---------------------------------------------------------------------------
+# Fake batch engines
+# ---------------------------------------------------------------------------
+
+
+class ScriptedBatchEngine:
+    """Batch engine driven by scripted per-request token sequences.
+
+    Each request produces a predefined list of (text, finish_reason) pairs.
+    One step() call pops one token per active request.
+    """
+
+    def __init__(self, *_args: Any, **_kwargs: Any):
+        self._active: dict[
+            int, tuple[CommandId, TaskId, list[tuple[str, FinishReason | None]]]
+        ] = {}
+        self._pending: list[tuple[CommandId, TaskId, TextGenerationTaskParams]] = []
+        self._uid = 0
+        self.rank = 0
+        # map command_id -> scripted tokens, set externally before tasks arrive
+        self.scripts: dict[str, list[tuple[str, FinishReason | None]]] = {}
+
+    def queue_request(
+        self,
+        command_id: CommandId,
+        task_id: TaskId,
+        task_params: TextGenerationTaskParams,
+    ) -> str:
+        self._pending.append((command_id, task_id, task_params))
+        return ""
+
+    def sync_and_insert_pending(self) -> list[int]:
+        uids: list[int] = []
+        for cmd_id, task_id, _params in self._pending:
+            uid = self._uid
+            self._uid += 1
+            script = list(self.scripts.get(str(cmd_id), [("tok", "stop")]))
+            self._active[uid] = (cmd_id, task_id, script)
+            uids.append(uid)
+        self._pending.clear()
+        return uids
+
+    @property
+    def has_pending_inserts(self) -> bool:
+        return bool(self._pending)
+
+    @property
+    def pending_insert_count(self) -> int:
+        return len(self._pending)
+
+    def step(self) -> list[BatchedGenerationResponse]:
+        results: list[BatchedGenerationResponse] = []
+        done: list[int] = []
+        for uid, (cmd_id, task_id, script) in self._active.items():
+            if not script:
+                continue
+            text, finish_reason = script.pop(0)
+            results.append(
+                BatchedGenerationResponse(
+                    command_id=cmd_id,
+                    task_id=task_id,
+                    response=GenerationResponse(
+                        token=0, text=text, finish_reason=finish_reason, usage=None
+                    ),
+                )
+            )
+            if finish_reason is not None:
+                done.append(uid)
+        for uid in done:
+            del self._active[uid]
+        return results
+
+    @property
+    def has_active_requests(self) -> bool:
+        return bool(self._active)
+
+    @property
+    def active_count(self) -> int:
+        return len(self._active)
+
+    def sync_completions(self) -> None:
+        pass
+
+    @property
+    def is_distributed(self) -> bool:
+        return False
+
+
+class FakeBatchEngineWithTokens:
+    """Generates N tokens per request (reused from the main test file)."""
+
+    def __init__(self, *_args: Any, **_kwargs: Any):
+        self._active_requests: dict[int, tuple[CommandId, TaskId, int, int]] = {}
+        self._pending_inserts: list[
+            tuple[CommandId, TaskId, TextGenerationTaskParams]
+        ] = []
+        self._uid_counter = 0
+        self.rank = 0
+
+    def queue_request(
+        self,
+        command_id: CommandId,
+        task_id: TaskId,
+        task_params: TextGenerationTaskParams,
+    ) -> str:
+        self._pending_inserts.append((command_id, task_id, task_params))
+        return ""
+
+    def sync_and_insert_pending(self) -> list[int]:
+        uids: list[int] = []
+        for command_id, task_id, task_params in self._pending_inserts:
+            uid = self._uid_counter
+            self._uid_counter += 1
+            max_tokens = task_params.max_output_tokens or 3
+            self._active_requests[uid] = (command_id, task_id, 0, max_tokens)
+            uids.append(uid)
+        self._pending_inserts.clear()
+        return uids
+
+    @property
+    def has_pending_inserts(self) -> bool:
+        return bool(self._pending_inserts)
+
+    @property
+    def pending_insert_count(self) -> int:
+        return len(self._pending_inserts)
+
+    def step(self) -> list[BatchedGenerationResponse]:
+        results: list[BatchedGenerationResponse] = []
+        done: list[int] = []
+        for uid, (cmd_id, task_id, tokens_gen, max_tokens) in list(
+            self._active_requests.items()
+        ):
+            tokens_gen += 1
+            finish = "stop" if tokens_gen >= max_tokens else None
+            results.append(
+                BatchedGenerationResponse(
+                    command_id=cmd_id,
+                    task_id=task_id,
+                    response=GenerationResponse(
+                        token=tokens_gen,
+                        text=f"token{tokens_gen}",
+                        finish_reason=finish,
+                        usage=None,
+                    ),
+                )
+            )
+            if finish:
+                done.append(uid)
+            else:
+                self._active_requests[uid] = (cmd_id, task_id, tokens_gen, max_tokens)
+        for uid in done:
+            del self._active_requests[uid]
+        return results
+
+    @property
+    def has_active_requests(self) -> bool:
+        return bool(self._active_requests)
+
+    @property
+    def active_count(self) -> int:
+        return len(self._active_requests)
+
+    def sync_completions(self) -> None:
+        pass
+
+    @property
+    def is_distributed(self) -> bool:
+        return False
+
+
+# ---------------------------------------------------------------------------
+# Mock tokenizers
+# ---------------------------------------------------------------------------
+
+
+class MockTokenizer:
+    tool_parser = None
+    tool_call_start = None
+    tool_call_end = None
+    has_tool_calling = False
+    has_thinking = False
+
+
+class MockToolTokenizer:
+    """Tokenizer with tool calling enabled for testing."""
+
+    has_tool_calling = True
+    has_thinking = False
+    tool_call_start = "<tool>"
+    tool_call_end = "</tool>"
+
+    @staticmethod
+    def _tool_parser(text: str) -> dict[str, Any]:
+        return json.loads(text)
+
+
+class FakeGroup:
+    def rank(self) -> int:
+        return 0
+
+    def size(self) -> int:
+        return 1
+
+
+# ---------------------------------------------------------------------------
+# Event collector & runner helper
+# ---------------------------------------------------------------------------
+
+
+class EventCollector:
+    def __init__(self) -> None:
+        self.events: list[Event] = []
+
+    def send(self, event: Event) -> None:
+        self.events.append(event)
+
+    def close(self) -> None:
+        pass
+
+    def join(self) -> None:
+        pass
+
+
+def make_nothin[T, U, V](res: T):
+    def nothin(*_1: U, **_2: V) -> T:
+        return res
+
+    return nothin
+
+
+INIT_TASK = ConnectToGroup(task_id=TaskId("init"), instance_id=INSTANCE_1_ID)
+LOAD_TASK = LoadModel(task_id=TaskId("load"), instance_id=INSTANCE_1_ID)
+WARMUP_TASK = StartWarmup(task_id=TaskId("warmup"), instance_id=INSTANCE_1_ID)
+SETUP_TASKS: list[Task] = [INIT_TASK, LOAD_TASK, WARMUP_TASK]
+
+
+def make_chat_task(
+    task_id: str, command_id: str, max_tokens: int = 3
+) -> TextGeneration:
+    return TextGeneration(
+        task_id=TaskId(task_id),
+        command_id=CommandId(command_id),
+        task_params=TextGenerationTaskParams(
+            model=MODEL_A_ID,
+            input=[InputMessage(role="user", content="hello")],
+            stream=True,
+            max_output_tokens=max_tokens,
+        ),
+        instance_id=INSTANCE_1_ID,
+    )
+
+
+def _run_with_tasks(
+    tasks: list[Task],
+    engine_cls: type = FakeBatchEngineWithTokens,
+    tokenizer_cls: type = MockTokenizer,
+    engine_instance: Any | None = None,
+) -> list[Event]:
+    """Run tasks through the runner with configurable engine and tokenizer."""
+    bound = get_bound_mlx_ring_instance(
+        instance_id=INSTANCE_1_ID,
+        model_id=MODEL_A_ID,
+        runner_id=RUNNER_1_ID,
+        node_id=NodeId(NODE_A),
+    )
+    task_sender, task_receiver = mp_channel[Task]()
+    collector = EventCollector()
+    shutdown = Shutdown(
+        task_id=TaskId("shutdown"),
+        instance_id=INSTANCE_1_ID,
+        runner_id=RUNNER_1_ID,
+    )
+
+    import exo.worker.runner.runner as r
+
+    orig_init_mlx = r.initialize_mlx
+    orig_load = r.load_mlx_items
+    orig_warmup = r.warmup_inference
+    orig_check = r._check_for_debug_prompts
+    orig_engine = r.BatchGenerationEngine
+
+    r.initialize_mlx = make_nothin(FakeGroup())
+    r.load_mlx_items = make_nothin((MagicMock(), tokenizer_cls))
+    r.warmup_inference = make_nothin(1)
+    r._check_for_debug_prompts = make_nothin(None)
+    if engine_instance is not None:
+        r.BatchGenerationEngine = lambda *_a, **_kw: engine_instance  # pyright: ignore[reportUnknownLambdaType]
+    else:
+        r.BatchGenerationEngine = engine_cls
+
+    try:
+        with task_sender:
+            for t in tasks:
+                task_sender.send(t)
+            task_sender.send(shutdown)
+            task_receiver.close = lambda: None
+            task_receiver.join = lambda: None
+            r.main(bound, collector, task_receiver)  # pyright: ignore[reportArgumentType]
+    finally:
+        r.initialize_mlx = orig_init_mlx
+        r.load_mlx_items = orig_load
+        r.warmup_inference = orig_warmup
+        r._check_for_debug_prompts = orig_check
+        r.BatchGenerationEngine = orig_engine
+
+    return collector.events
+
+
+# ---------------------------------------------------------------------------
+# Helpers for querying events
+# ---------------------------------------------------------------------------
+
+
+def chunks_for(events: list[Event], command_id: str) -> list[ChunkGenerated]:
+    return [
+        e
+        for e in events
+        if isinstance(e, ChunkGenerated) and e.command_id == CommandId(command_id)
+    ]
+
+
+def completed_task_ids(events: list[Event]) -> set[TaskId]:
+    return {
+        e.task_id
+        for e in events
+        if isinstance(e, TaskStatusUpdated) and e.task_status == TaskStatus.Complete
+    }
+
+
+# ===========================================================================
+# Test 1: Concurrent requests with overlapping tool calls
+# ===========================================================================
+
+
+def test_concurrent_tool_calls_and_normal_text():
+    """Two concurrent requests: one emits normal text, the other a tool call.
+
+    Verifies that:
+    - The normal request produces TokenChunks with its text
+    - The tool-call request produces a ToolCallChunk
+    - Both tasks complete
+    """
+    engine = ScriptedBatchEngine()
+    # cmd_normal: 2 normal tokens then stop
+    engine.scripts["cmd_normal"] = [
+        ("hello", None),
+        (" world", "stop"),
+    ]
+    # cmd_tool: tool_start, body, tool_end (suppressed), then finish
+    engine.scripts["cmd_tool"] = [
+        ("<tool>", None),  # swallowed by tracker
+        ('{"name":"get_weather","arguments":{"city":"SF"}}', None),  # accumulated
+        ("</tool>", None),  # triggers ToolCallChunk emission
+        ("done", "stop"),  # normal trailing token
+    ]
+
+    chat_normal = make_chat_task("t_normal", "cmd_normal", max_tokens=100)
+    chat_tool = make_chat_task("t_tool", "cmd_tool", max_tokens=100)
+
+    events = _run_with_tasks(
+        [*SETUP_TASKS, chat_normal, chat_tool],
+        tokenizer_cls=MockToolTokenizer,
+        engine_instance=engine,
+    )
+
+    # Normal request: all chunks should be TokenChunk
+    normal_chunks = chunks_for(events, "cmd_normal")
+    assert len(normal_chunks) == 2
+    assert all(isinstance(c.chunk, TokenChunk) for c in normal_chunks)
+    assert normal_chunks[-1].chunk.finish_reason == "stop"
+
+    # Tool-call request
+    tool_chunks = chunks_for(events, "cmd_tool")
+    # <tool> → swallowed, body → accumulated, </tool> → ToolCallChunk, "done" → TokenChunk
+    tool_call_events = [c for c in tool_chunks if isinstance(c.chunk, ToolCallChunk)]
+    token_events = [c for c in tool_chunks if isinstance(c.chunk, TokenChunk)]
+
+    assert len(tool_call_events) == 1, (
+        f"Expected 1 ToolCallChunk, got {len(tool_call_events)}"
+    )
+    tc_chunk = tool_call_events[0].chunk
+    assert isinstance(tc_chunk, ToolCallChunk)
+    assert tc_chunk.tool_calls[0].name == "get_weather"
+    assert json.loads(tc_chunk.tool_calls[0].arguments) == {"city": "SF"}
+
+    assert len(token_events) == 1, "Expected 1 trailing TokenChunk after tool call"
+    assert token_events[0].chunk.finish_reason == "stop"
+
+    # Both tasks should complete
+    done = completed_task_ids(events)
+    assert TaskId("t_normal") in done
+    assert TaskId("t_tool") in done
+
+
+def test_tool_call_interrupted_by_finish_reason():
+    """Tool call in progress when finish_reason fires — partial text emitted."""
+    engine = ScriptedBatchEngine()
+    engine.scripts["cmd1"] = [
+        ("<tool>", None),
+        ('{"name":"f"', "stop"),  # finish while inside tool call
+    ]
+
+    chat = make_chat_task("t1", "cmd1", max_tokens=100)
+    events = _run_with_tasks(
+        [*SETUP_TASKS, chat],
+        tokenizer_cls=MockToolTokenizer,
+        engine_instance=engine,
+    )
+
+    chunks = chunks_for(events, "cmd1")
+    assert len(chunks) == 1
+    chunk = chunks[0].chunk
+    assert isinstance(chunk, TokenChunk)
+    # The interrupted tool call should be emitted as raw text
+    assert "<tool>" in chunk.text
+    assert '{"name":"f"' in chunk.text
+    assert chunk.finish_reason == "stop"
+
+    assert TaskId("t1") in completed_task_ids(events)
+
+
+# ===========================================================================
+# Test 2: Request finishing with 'length' reason (timeout mid-generation)
+# ===========================================================================
+
+
+def test_request_finishes_with_length_reason():
+    """Request that hits max_tokens limit and finishes with 'length'."""
+    engine = ScriptedBatchEngine()
+    engine.scripts["cmd1"] = [
+        ("tok1", None),
+        ("tok2", None),
+        ("tok3", "length"),  # hit the token limit
+    ]
+
+    chat = make_chat_task("t1", "cmd1", max_tokens=100)
+    events = _run_with_tasks(
+        [*SETUP_TASKS, chat],
+        engine_instance=engine,
+    )
+
+    chunks = chunks_for(events, "cmd1")
+    assert len(chunks) == 3
+
+    # Last chunk should have finish_reason="length"
+    assert isinstance(chunks[-1].chunk, TokenChunk)
+    assert chunks[-1].chunk.finish_reason == "length"
+
+    # Earlier chunks should have no finish_reason
+    for c in chunks[:-1]:
+        assert isinstance(c.chunk, TokenChunk)
+        assert c.chunk.finish_reason is None
+
+    assert TaskId("t1") in completed_task_ids(events)
+
+
+def test_mixed_finish_reasons_across_requests():
+    """Two requests finishing with different reasons: 'stop' and 'length'."""
+    engine = ScriptedBatchEngine()
+    engine.scripts["cmd_stop"] = [("a", None), ("b", "stop")]
+    engine.scripts["cmd_len"] = [("x", None), ("y", "length")]
+
+    chat1 = make_chat_task("t_stop", "cmd_stop", max_tokens=100)
+    chat2 = make_chat_task("t_len", "cmd_len", max_tokens=100)
+
+    events = _run_with_tasks(
+        [*SETUP_TASKS, chat1, chat2],
+        engine_instance=engine,
+    )
+
+    stop_chunks = chunks_for(events, "cmd_stop")
+    len_chunks = chunks_for(events, "cmd_len")
+
+    assert stop_chunks[-1].chunk.finish_reason == "stop"
+    assert len_chunks[-1].chunk.finish_reason == "length"
+
+    done = completed_task_ids(events)
+    assert TaskId("t_stop") in done
+    assert TaskId("t_len") in done
+
+
+# ===========================================================================
+# Test 3: Multiple finish reasons in rapid succession (same step)
+# ===========================================================================
+
+
+def test_all_requests_finish_on_same_step():
+    """Three requests that all finish on the same step() call.
+
+    This tests that the runner and _process_generation_results correctly
+    handle multiple completions in a single step.
+    """
+    engine = ScriptedBatchEngine()
+    # All three produce exactly 1 token and finish
+    engine.scripts["cmd_a"] = [("alpha", "stop")]
+    engine.scripts["cmd_b"] = [("beta", "stop")]
+    engine.scripts["cmd_c"] = [("gamma", "stop")]
+
+    tasks = [
+        *SETUP_TASKS,
+        make_chat_task("ta", "cmd_a", max_tokens=100),
+        make_chat_task("tb", "cmd_b", max_tokens=100),
+        make_chat_task("tc", "cmd_c", max_tokens=100),
+    ]
+    events = _run_with_tasks([*tasks], engine_instance=engine)
+
+    for cmd_id, expected_text in [
+        ("cmd_a", "alpha"),
+        ("cmd_b", "beta"),
+        ("cmd_c", "gamma"),
+    ]:
+        c = chunks_for(events, cmd_id)
+        assert len(c) == 1, f"Expected 1 chunk for {cmd_id}, got {len(c)}"
+        assert isinstance(c[0].chunk, TokenChunk)
+        assert c[0].chunk.text == expected_text
+        assert c[0].chunk.finish_reason == "stop"
+
+    done = completed_task_ids(events)
+    assert TaskId("ta") in done
+    assert TaskId("tb") in done
+    assert TaskId("tc") in done
+
+    # Runner should reach RunnerReady at least after warmup.
+    # With inline task processing, later requests may be inserted into the
+    # batch before the generation loop exits, so the runner can stay
+    # RunnerRunning until Shutdown without an intermediate RunnerReady.
+    ready_events = [
+        e
+        for e in events
+        if isinstance(e, RunnerStatusUpdated)
+        and isinstance(e.runner_status, RunnerReady)
+    ]
+    assert len(ready_events) >= 1, "Expected RunnerReady at least after warmup"
+
+
+def test_staggered_completions_in_batch():
+    """Four requests with different token counts — they complete at different steps.
+
+    Verifies each request gets the right number of chunks and the runner
+    tracks active_requests correctly as requests drain.
+    """
+    engine = ScriptedBatchEngine()
+    engine.scripts["c1"] = [("a", "stop")]  # finishes step 1
+    engine.scripts["c2"] = [("a", None), ("b", "stop")]  # finishes step 2
+    engine.scripts["c3"] = [("a", None), ("b", None), ("c", "stop")]  # finishes step 3
+    engine.scripts["c4"] = [
+        ("a", None),
+        ("b", None),
+        ("c", None),
+        ("d", "stop"),
+    ]  # finishes step 4
+
+    tasks = [
+        *SETUP_TASKS,
+        make_chat_task("t1", "c1", max_tokens=100),
+        make_chat_task("t2", "c2", max_tokens=100),
+        make_chat_task("t3", "c3", max_tokens=100),
+        make_chat_task("t4", "c4", max_tokens=100),
+    ]
+    events = _run_with_tasks([*tasks], engine_instance=engine)
+
+    assert len(chunks_for(events, "c1")) == 1
+    assert len(chunks_for(events, "c2")) == 2
+    assert len(chunks_for(events, "c3")) == 3
+    assert len(chunks_for(events, "c4")) == 4
+
+    done = completed_task_ids(events)
+    for tid in ["t1", "t2", "t3", "t4"]:
+        assert TaskId(tid) in done, f"Task {tid} should be complete"
+
+
+# ===========================================================================
+# Test 4: Batch of 5+ simultaneous completions
+# ===========================================================================
+
+
+@pytest.fixture
+def patch_batch_engine(monkeypatch: pytest.MonkeyPatch):
+    monkeypatch.setattr(mlx_runner, "initialize_mlx", make_nothin(FakeGroup()))
+    monkeypatch.setattr(
+        mlx_runner, "load_mlx_items", make_nothin((MagicMock(), MockTokenizer))
+    )
+    monkeypatch.setattr(mlx_runner, "warmup_inference", make_nothin(1))
+    monkeypatch.setattr(mlx_runner, "_check_for_debug_prompts", make_nothin(None))
+    monkeypatch.setattr(mlx_runner, "BatchGenerationEngine", FakeBatchEngineWithTokens)
+
+
+def test_five_simultaneous_completions(patch_batch_engine: None):
+    """Five requests submitted together, all generating tokens and completing."""
+    chats = [make_chat_task(f"t{i}", f"cmd{i}", max_tokens=2) for i in range(5)]
+    events = _run_with_tasks([*SETUP_TASKS, *chats])
+
+    for i in range(5):
+        c = chunks_for(events, f"cmd{i}")
+        assert len(c) == 2, f"Expected 2 chunks for cmd{i}, got {len(c)}"
+        assert c[-1].chunk.finish_reason == "stop"
+
+    done = completed_task_ids(events)
+    for i in range(5):
+        assert TaskId(f"t{i}") in done
+
+
+def test_eight_requests_staggered(patch_batch_engine: None):
+    """Eight requests with varying token counts, verifying all complete correctly."""
+    chats = [make_chat_task(f"t{i}", f"cmd{i}", max_tokens=i + 1) for i in range(8)]
+    events = _run_with_tasks([*SETUP_TASKS, *chats])
+
+    for i in range(8):
+        c = chunks_for(events, f"cmd{i}")
+        expected = i + 1
+        assert len(c) == expected, (
+            f"Expected {expected} chunks for cmd{i}, got {len(c)}"
+        )
+        assert c[-1].chunk.finish_reason == "stop"
+
+    done = completed_task_ids(events)
+    for i in range(8):
+        assert TaskId(f"t{i}") in done
+
+    # Verify runner transitions back to ready after all requests complete
+    # Find the last RunnerReady before shutdown
+    ready_events = [
+        (idx, e)
+        for idx, e in enumerate(events)
+        if isinstance(e, RunnerStatusUpdated)
+        and isinstance(e.runner_status, RunnerReady)
+    ]
+    shutdown_idx = next(
+        idx
+        for idx, e in enumerate(events)
+        if isinstance(e, TaskStatusUpdated)
+        and e.task_id == TaskId("shutdown")
+        and e.task_status == TaskStatus.Running
+    )
+    # There should be a RunnerReady event between generation and shutdown
+    ready_before_shutdown = [idx for idx, _ in ready_events if idx < shutdown_idx]
+    assert len(ready_before_shutdown) >= 1, (
+        "Expected RunnerReady between generation completion and shutdown"
+    )
+
+
+def test_ten_simultaneous_single_token():
+    """Ten requests that each produce exactly one token — all finish on step 1."""
+    engine = ScriptedBatchEngine()
+    for i in range(10):
+        engine.scripts[f"cmd{i}"] = [(f"word{i}", "stop")]
+
+    chats = [make_chat_task(f"t{i}", f"cmd{i}", max_tokens=100) for i in range(10)]
+    events = _run_with_tasks([*SETUP_TASKS, *chats], engine_instance=engine)
+
+    for i in range(10):
+        c = chunks_for(events, f"cmd{i}")
+        assert len(c) == 1
+        assert isinstance(c[0].chunk, TokenChunk)
+        assert c[0].chunk.text == f"word{i}"
+        assert c[0].chunk.finish_reason == "stop"
+
+    done = completed_task_ids(events)
+    assert len(done & {TaskId(f"t{i}") for i in range(10)}) == 10
--- a/src/exo/worker/tests/unittests/test_runner/test_event_ordering.py
+++ b/src/exo/worker/tests/unittests/test_runner/test_event_ordering.py
@@ -1,13 +1,12 @@
 # Check tasks are complete before runner is ever ready.
-import unittest.mock
 from collections.abc import Iterable
 from typing import Callable

-import mlx.core as mx
 import pytest

 import exo.worker.runner.runner as mlx_runner
 from exo.shared.types.chunks import TokenChunk
+from exo.shared.types.common import CommandId
 from exo.shared.types.events import (
    ChunkGenerated,
    Event,
@@ -40,6 +39,7 @@ from exo.shared.types.worker.runners import (
    RunnerWarmingUp,
 )
 from exo.utils.channels import mp_channel
+from exo.worker.engines.mlx.generator.batch_engine import BatchedGenerationResponse

 from ...constants import (
    CHAT_COMPLETION_TASK_ID,
@@ -116,16 +116,7 @@ def patch_out_mlx(monkeypatch: pytest.MonkeyPatch):
    monkeypatch.setattr(mlx_runner, "load_mlx_items", make_nothin((1, MockTokenizer)))
    monkeypatch.setattr(mlx_runner, "warmup_inference", make_nothin(1))
    monkeypatch.setattr(mlx_runner, "_check_for_debug_prompts", nothin)
-    monkeypatch.setattr(mlx_runner, "mx_any", make_nothin(False))
-    # Mock apply_chat_template since we're using a fake tokenizer (integer 1).
-    # Returns a prompt without thinking tag so detect_thinking_prompt_suffix returns None.
-    monkeypatch.setattr(mlx_runner, "apply_chat_template", make_nothin("test prompt"))
-    monkeypatch.setattr(mlx_runner, "detect_thinking_prompt_suffix", make_nothin(False))
-
-    def fake_generate(*_1: object, **_2: object):
-        yield GenerationResponse(token=0, text="hi", finish_reason="stop", usage=None)
-
-    monkeypatch.setattr(mlx_runner, "mlx_generate", fake_generate)
+    monkeypatch.setattr(mlx_runner, "BatchGenerationEngine", FakeBatchEngine)


 # Use a fake event_sender to remove test flakiness.
@@ -148,6 +139,7 @@ class MockTokenizer:
    tool_call_start = None
    tool_call_end = None
    has_tool_calling = False
+    has_thinking = False


 class MockGroup:
@@ -158,6 +150,70 @@ class MockGroup:
        return 1


+class FakeBatchEngine:
+    """Fake batch engine that generates a single 'hi' token per request."""
+
+    def __init__(self, *_args: object, **_kwargs: object):
+        self._active_requests: dict[int, tuple[CommandId, TaskId]] = {}
+        self._pending_inserts: list[tuple[CommandId, TaskId, object]] = []
+        self._uid_counter = 0
+        self.rank = 0
+
+    def queue_request(
+        self, command_id: CommandId, task_id: TaskId, task_params: object
+    ) -> str:
+        self._pending_inserts.append((command_id, task_id, task_params))
+        return ""
+
+    def sync_and_insert_pending(self) -> list[int]:
+        uids: list[int] = []
+        for cmd_id, task_id, _params in self._pending_inserts:
+            uid = self._uid_counter
+            self._uid_counter += 1
+            self._active_requests[uid] = (cmd_id, task_id)
+            uids.append(uid)
+        self._pending_inserts.clear()
+        return uids
+
+    def step(self) -> list[BatchedGenerationResponse]:
+        results: list[BatchedGenerationResponse] = []
+        for _uid, (cmd_id, task_id) in list(self._active_requests.items()):
+            results.append(
+                BatchedGenerationResponse(
+                    command_id=cmd_id,
+                    task_id=task_id,
+                    response=GenerationResponse(
+                        token=0, text="hi", finish_reason="stop", usage=None
+                    ),
+                )
+            )
+        self._active_requests.clear()
+        return results
+
+    def sync_completions(self) -> None:
+        pass
+
+    @property
+    def has_active_requests(self) -> bool:
+        return bool(self._active_requests)
+
+    @property
+    def has_pending_inserts(self) -> bool:
+        return bool(self._pending_inserts)
+
+    @property
+    def pending_insert_count(self) -> int:
+        return len(self._pending_inserts)
+
+    @property
+    def active_count(self) -> int:
+        return len(self._active_requests)
+
+    @property
+    def is_distributed(self) -> bool:
+        return False
+
+
 def _run(tasks: Iterable[Task]):
    bound_instance = get_bound_mlx_ring_instance(
        instance_id=INSTANCE_1_ID,
@@ -167,7 +223,6 @@ def _run(tasks: Iterable[Task]):
    )

    task_sender, task_receiver = mp_channel[Task]()
-    _cancel_sender, cancel_receiver = mp_channel[TaskId]()
    event_sender = EventCollector()

    with task_sender:
@@ -178,16 +233,8 @@ def _run(tasks: Iterable[Task]):
        # this is some c++ nonsense
        task_receiver.close = nothin
        task_receiver.join = nothin
-        with unittest.mock.patch(
-            "exo.worker.runner.runner.mx.distributed.all_gather",
-            make_nothin(mx.array([1])),
-        ):
-            mlx_runner.main(
-                bound_instance,
-                event_sender,  # pyright: ignore[reportArgumentType]
-                task_receiver,
-                cancel_receiver,
-            )
+
+        mlx_runner.main(bound_instance, event_sender, task_receiver)  # type: ignore[arg-type]

        return event_sender.events

@@ -232,17 +279,22 @@ def test_events_processed_in_correct_order(patch_out_mlx: pytest.MonkeyPatch):
            TaskAcknowledged(task_id=WARMUP_TASK_ID),
            TaskStatusUpdated(task_id=WARMUP_TASK_ID, task_status=TaskStatus.Complete),
            RunnerStatusUpdated(runner_id=RUNNER_1_ID, runner_status=RunnerReady()),
+            # CHAT TASK: queued, tokens generated, then completed
            TaskStatusUpdated(
                task_id=CHAT_COMPLETION_TASK_ID, task_status=TaskStatus.Running
            ),
-            RunnerStatusUpdated(runner_id=RUNNER_1_ID, runner_status=RunnerRunning()),
            TaskAcknowledged(task_id=CHAT_COMPLETION_TASK_ID),
+            RunnerStatusUpdated(
+                runner_id=RUNNER_1_ID,
+                runner_status=RunnerRunning(active_requests=1),
+            ),
+            # Generation loop produces token and completes the task
            expected_chunk,
            TaskStatusUpdated(
                task_id=CHAT_COMPLETION_TASK_ID, task_status=TaskStatus.Complete
            ),
-            # CHAT COMPLETION TASK SHOULD COMPLETE BEFORE RUNNER READY
            RunnerStatusUpdated(runner_id=RUNNER_1_ID, runner_status=RunnerReady()),
+            # SHUTDOWN
            TaskStatusUpdated(task_id=SHUTDOWN_TASK_ID, task_status=TaskStatus.Running),
            RunnerStatusUpdated(
                runner_id=RUNNER_1_ID, runner_status=RunnerShuttingDown()
@@ -251,7 +303,6 @@ def test_events_processed_in_correct_order(patch_out_mlx: pytest.MonkeyPatch):
            TaskStatusUpdated(
                task_id=SHUTDOWN_TASK_ID, task_status=TaskStatus.Complete
            ),
-            # SPECIAL EXCEPTION FOR RUNNER SHUTDOWN
            RunnerStatusUpdated(runner_id=RUNNER_1_ID, runner_status=RunnerShutdown()),
        ],
    )
--- a/src/exo/worker/tests/unittests/test_runner/test_parse_tool_calls.py
+++ b/src/exo/worker/tests/unittests/test_runner/test_parse_tool_calls.py
@@ -5,13 +5,12 @@ from typing import Any

 from exo.shared.types.worker.runner_response import GenerationResponse, ToolCallResponse
 from exo.worker.runner.runner import parse_tool_calls
-from exo.worker.runner.tool_parsers import make_mlx_parser


 def _make_responses(
    texts: list[str],
    finish_on_last: bool = True,
-) -> Generator[GenerationResponse]:
+) -> Generator[GenerationResponse | ToolCallResponse]:
    """Create a sequence of GenerationResponses from text strings."""
    for i, text in enumerate(texts):
        is_last = i == len(texts) - 1
@@ -23,13 +22,10 @@ def _make_responses(
        )


-def _dummier_parser(text: str) -> dict[str, Any]:
+def _dummy_parser(text: str) -> dict[str, Any]:
    return {"name": "test_fn", "arguments": {"arg": text}}


-_dummy_parser = make_mlx_parser("<tool_call>", "</tool_call>", _dummier_parser)
-
-
 class TestParseToolCalls:
    """Tests for parse_tool_calls generator."""

@@ -39,6 +35,8 @@ class TestParseToolCalls:
        results = list(
            parse_tool_calls(
                _make_responses(texts, finish_on_last=False),
+                "<tool_call>",
+                "</tool_call>",
                _dummy_parser,
            )
        )
@@ -52,6 +50,8 @@ class TestParseToolCalls:
        results = list(
            parse_tool_calls(
                _make_responses(texts),
+                "<tool_call>",
+                "</tool_call>",
                _dummy_parser,
            )
        )
@@ -76,7 +76,9 @@ class TestParseToolCalls:
        results = list(
            parse_tool_calls(
                _make_responses(texts, finish_on_last=False),
-                make_mlx_parser("<tool_call>", "</tool_call>", _failing_parser),
+                "<tool_call>",
+                "</tool_call>",
+                _failing_parser,
            )
        )
Author	SHA1	Message	Date
Alex Cheema	7469f44e58	fix: clean up stale runners from state when instance is deleted apply_instance_deleted() previously only removed the instance from state.instances, leaving its runner entries orphaned in state.runners with their last known status (e.g. RunnerReady). After a node kill and rejoin, readiness checks would see these stale entries and attempt inference against dead runner processes, causing post-recovery failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 18:41:22 -08:00
Alex Cheema	5d26d2dcd6	fix: eliminate serialization bottlenecks in continuous batching pipeline The batch prefill deferral (`35973b86`) was insufficient alone because multiple other serialization points prevented true concurrent request processing. This fixes four bottlenecks: - Drain all available tasks per plan_step cycle instead of one-per-100ms - Keep tasks in supervisor pending set after ACK to prevent re-dispatch - Process TextGeneration tasks inline during decode loop (no break/restart) - Pre-tokenize in queue_request so sync_and_insert_pending is lightweight - Reuse prompt from queue_request for thinking detection (no double apply_chat_template) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 18:36:56 -08:00
Alex Cheema	35973b8698	fix: defer batch prefill for true continuous batching Move sync_and_insert_pending() out of the per-task loop so all concurrently-arrived requests share a single batched prefill pass. Previously each request was prefilled individually as it arrived, serializing what should be a parallel operation. Also break early from the generation TimeBudget loop when new tasks are waiting, so they get inserted sooner rather than blocking for the full 0.5s budget. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 17:52:04 -08:00
Alex Cheema	41d9d2a61f	fix: add has_thinking to mock tokenizers in edge case tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 17:02:07 -08:00
Alex Cheema	efbf9850eb	style: apply nix fmt to new edge case test file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 16:55:38 -08:00
Alex Cheema	4a22c4b512	fix: restore per-request sampling, model-specific parsers, error handling, and tracing - Use per-request temperature/top_p/top_k from TextGenerationTaskParams instead of hardcoded sampler defaults in BatchGenerationEngine - Restore model-specific tokenizer patches (Kimi, GLM) at load time - Add GptOssTracker for per-request GPT-OSS stream parsing with thinking channels and tool call routing - Filter Kimi section boundary tokens from batch output - Detect thinking prompt suffix and prepend think_start on first token - Wrap batch_engine.step() in try/except, send ErrorChunks on failure - Call _send_traces_if_enabled() when text generation tasks complete Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 16:55:18 -08:00
Alex Cheema	4d0fe5d17b	test: add edge-case tests for continuous batching Cover concurrent tool calls, length/stop finish reasons, multiple completions per step, staggered draining, and batches of 5-10 requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 16:27:45 -08:00
Alex Cheema	9fe7251796	fix: add generation loop, deferred task completion, and tool call tracking The batch engine integration had three critical issues: 1. No generation loop - batch_engine.step() only ran during shutdown drain 2. Tasks marked complete before any tokens were generated 3. Tool calls dropped - parse_tool_calls pipeline was disconnected Restructure runner main() into a two-phase while loop that alternates between TimeBudget-based generation steps and task polling. Add ToolCallTracker for per-request tool call state in the batch path, and defer TaskStatusUpdated(Complete) until finish_reason is set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 15:22:09 -08:00
Alex Cheema	1c8f69ce00	fix: address PR review comments for continuous batching - Use get_args(FinishReason) instead of hardcoded finish reason checks - Use new Python type syntax (def share_object[T]) instead of TypeVar - Assert obj is not None for rank 0 with message to use mx_barrier() - Raise RuntimeError on size=0 instead of silently returning None - Simplify share_object callers since return type is now T (not T \| None) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 14:29:13 -08:00
Alex Cheema	f19166617a	fix: use EventCollector instead of mp_channel to fix flaky test Replace mp_channel event receiver with direct EventCollector in test_continuous_batching.py to eliminate multiprocessing pipe race condition that caused test_runner_status_reflects_active_requests to intermittently miss RunnerRunning events. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 14:00:28 -08:00
Alex Cheema	51e959c979	feat: integrate BatchGenerationEngine into runner for continuous batching Replace synchronous per-request text generation with BatchGenerationEngine, enabling continuous batching of multiple concurrent inference requests. - Runner accepts TextGeneration in both RunnerReady and RunnerRunning states - Requests are queued and sync-inserted into the batch engine - Batch engine is drained during shutdown to complete in-flight requests - Only rank 0 emits ChunkGenerated events (distributed-safe) - Enable previously-skipped continuous batching tests - Update event ordering tests for the new batch-based flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 13:47:04 -08:00
Alex Cheema	cd43588a04	Merge remote-tracking branch 'origin/main' into alexcheema/continuous-batching	2026-02-13 10:09:40 -08:00
Alex Cheema	7b879593bb	fix: update continuous batching types after main merge Replace ChatCompletionTaskParams with TextGenerationTaskParams and ChatCompletion with TextGeneration to match the refactored type hierarchy from main. Add missing usage parameter to GenerationResponse constructors and add type annotations to StreamingDetokenizer stubs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 07:29:13 -08:00
Alex Cheema	e4e895d7a8	Merge remote-tracking branch 'origin/main' into alexcheema/continuous-batching # Conflicts: # AGENTS.md	2026-02-13 05:57:15 -08:00
Alex Cheema	db400dbb75	skip continuous batching tests pending type migration The continuous batching runner architecture references old types (ChatCompletion, ChatCompletionTaskParams) that were renamed on main. Skip the test module until the batch engine code is updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 06:18:40 -08:00
Alex Cheema	15fad9c632	Merge remote-tracking branch 'origin/main' into alexcheema/continuous-batching # Conflicts: # .mlx_typings/mlx_lm/tokenizer_utils.pyi # src/exo/worker/runner/runner.py # src/exo/worker/runner/runner_supervisor.py # src/exo/worker/tests/unittests/test_runner/test_event_ordering.py	2026-02-05 06:12:49 -08:00
Alex Cheema	842beefac0	feat: add continuous batching for distributed inference Implements continuous batching using mlx_lm's BatchGenerator for efficient multi-request handling in distributed mode. Key changes: - Add BatchGenerationEngine that wraps mlx_lm's BatchGenerator for continuous batching with prefill batching (up to 8 requests) and decode batching - Add TimeBudget pattern for controlling generation loop timing with periodic distributed sync - Add distributed_sync utilities for broadcasting objects across ranks using mx.distributed.all_sum() - Stream tokens immediately as generated for smooth streaming (not in batches) - Fix distributed correctness: deferred shutdown handling, sync_completions always syncs in distributed mode to prevent deadlocks Performance results on Kimi K2 Thinking (658GB) with Tensor RDMA: - Batch 1: 10.7 tok/s (baseline) - Batch 4: 34.6 tok/s (3.2x speedup) - Batch 16: 41.8 tok/s (3.9x speedup) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 12:27:15 +00:00
				`@@ -0,0 +1 @@`
				`collect_ignore = ["tests/start_distributed_test.py"]`