maybe

Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility (#1125 )
## Motivation Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models. The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes. ## Changes ### Core Changes - **mlx-lm upgrade**: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes - **transformers 5.x compatibility**: Enable prerelease transformers dependency ### Kimi K2 Tokenizer Fixes - Add `bytes_to_unicode` monkey-patch to restore function moved in transformers 5.0.0rc2 - Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to bypass transformers 5.x bug with `auto_map` fallback - Patch `encode()` to use tiktoken directly with `allowed_special="all"` to handle special tokens from chat templates ### Other Changes - Dashboard: Show disk usage for completed model downloads - CI: Add `workflow_dispatch` trigger to build-app workflow - Docs: Add basic API documentation ### Testing - Add comprehensive tokenizer unit tests for all supported models - Tests verify encode/decode, special token handling, and chat template encoding ## Why It Works **bytes_to_unicode issue**: transformers 5.0.0rc2 moved `bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to `transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py` imports from the old location. The monkey-patch restores it at module load time. **AutoTokenizer issue**: transformers 5.x has a bug where `tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for custom tokenizers with `auto_map`. Loading the tokenizer directly bypasses this. **encode() issue**: transformers 5.x's `pad()` method fails for slow tokenizers. Using tiktoken's encode directly with `allowed_special="all"` avoids this path and properly handles special tokens like `<|im_user|>` from chat templates. ## Test Plan ### Manual Testing - Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21) - Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both nodes - Verified warmup inference completes successfully - Verified chat completions work with special tokens ### Automated Testing - Added `test_tokenizers.py` with 31 tests covering: - Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm) - Special token encoding (critical for chat templates) - Chat template application and encoding - Kimi-specific and GLM-specific edge cases - All tests pass: `uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py` ### Failing Tests RDMA with all models. --------- Co-authored-by: Evan <evanev7@gmail.com>
2026-01-13 08:29:21 -05:00 · 2026-01-13 13:11:23 +00:00 · 2026-01-13 12:06:04 +00:00 · 2026-01-13 12:42:04 +01:00 · 2026-01-13 12:37:12 +01:00 · 2026-01-12 17:24:59 +01:00
21 changed files with 2103 additions and 969 deletions
--- a/.github/workflows/build-app.yml
+++ b/.github/workflows/build-app.yml
@@ -1,6 +1,7 @@
 name: Build EXO macOS DMG

 on:
+  workflow_dispatch:
  push:
    tags:
      - "v*"
@@ -35,7 +36,7 @@ jobs:

      - name: Derive release version from tag
        run: |
-          if [[ "$GITHUB_REF_NAME" == "test-app" ]]; then
+          if [[ "$GITHUB_REF_NAME" == "test-app" || "${{ github.event_name }}" == "workflow_dispatch" ]]; then
            VERSION="0.0.0-alpha.0"
            echo "IS_ALPHA=true" >> $GITHUB_ENV
          else
--- a/.github/workflows/pipeline.yml
+++ b/.github/workflows/pipeline.yml
@@ -20,6 +20,12 @@ jobs:
        with:
          nix_path: nixpkgs=channel:nixos-unstable

+      - uses: cachix/cachix-action@v14
+        name: Configure Cachix
+        with:
+          name: exo
+          authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
+
      - name: Configure git user
        run: |
          git config --local user.email "github-actions@users.noreply.github.com"
@@ -88,9 +94,19 @@ jobs:

      - uses: ./.github/actions/typecheck

-  nix-flake-check:
-    name: Check Nix flake
-    runs-on: ubuntu-latest
+  nix:
+    name: Build and check (${{ matrix.system }})
+    runs-on: ${{ matrix.runner }}
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - runner: macos-26
+            system: aarch64-darwin
+          - runner: ubuntu-latest
+            system: x86_64-linux
+          - runner: ubuntu-24.04-arm
+            system: aarch64-linux
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
@@ -101,83 +117,20 @@ jobs:
        with:
          nix_path: nixpkgs=channel:nixos-unstable

-      - name: Run nix flake check
-        run: |
-          nix flake check
-        shell: bash
+      - uses: cachix/cachix-action@v14
+        name: Configure Cachix
+        with:
+          name: exo
+          authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"

-#  ci:
-#    needs: typecheck
-#    runs-on: ubuntu-latest
-#    permissions:
-#      contents: read
-#    env:
-#      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-#    steps:
-#      - name: Checkout repository
-#        uses: actions/checkout@v4
-#        with:
-#          fetch-depth: 0
-#          token: ${{ secrets.GITHUB_TOKEN }}
-#          lfs: true
-#
-#      - name: Configure git user
-#        run: |
-#          git config --local user.email "github-actions@users.noreply.github.com"
-#          git config --local user.name  "github-actions bot"
-#        shell: bash
-#
-#      - name: Pull LFS files
-#        run: |
-#          echo "Pulling Git LFS files..."
-#          git lfs pull
-#        shell: bash
-#
-#      - name: Setup EXO_HOME and API_PORT
-#        run: |
-#          EXO_HOME=$(mktemp -d -t exo-ci-XXXXXXXX)
-#          # Generate random port (macOS compatible method)
-#          API_PORT=$((49152 + RANDOM % (65535 - 49152 + 1)))
-#          echo "EXO_HOME=$EXO_HOME" >> $GITHUB_ENV
-#          echo "API_PORT=$API_PORT" >> $GITHUB_ENV
-#          echo "Created EXO_HOME: $EXO_HOME"
-#          echo "Generated API_PORT: $API_PORT"
-#        shell: bash
-#
-#      - name: Setup Nix Environment
-#        run: |
-#          echo "Checking for nix installation..."
-#          
-#          # Check if nix binary exists directly
-#          if [ -f /nix/var/nix/profiles/default/bin/nix ]; then
-#            echo "Found nix binary at /nix/var/nix/profiles/default/bin/nix"
-#            export PATH="/nix/var/nix/profiles/default/bin:$PATH"
-#            echo "PATH=$PATH" >> $GITHUB_ENV
-#            nix --version
-#          elif [ -f /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh ]; then
-#            echo "Found nix profile script, sourcing..."
-#            source /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh
-#            nix --version
-#          elif command -v nix >/dev/null 2>&1; then
-#            echo "Nix already in PATH"
-#            nix --version
-#          else
-#            echo "Nix not found. Debugging info:"
-#            echo "Contents of /nix/var/nix/profiles/default/:"
-#            ls -la /nix/var/nix/profiles/default/ 2>/dev/null || echo "Directory not found"
-#            echo "Contents of /nix/var/nix/profiles/default/bin/:"
-#            ls -la /nix/var/nix/profiles/default/bin/ 2>/dev/null || echo "Directory not found"
-#            exit 1
-#          fi
-#        shell: bash
-#
-#      - uses: ./.github/actions/lint-check
-#
-#      - uses: ./.github/actions/unit-test
-#
-#      - name: Cleanup EXO_HOME
-#        run: |
-#          echo "Cleaning up EXO_HOME: $EXO_HOME"
-#          rm -rf "$EXO_HOME"
-#        shell: bash
-#        if: always()
+      - name: Build all Nix outputs
+        run: |
+          nix flake show --json | jq -r '
+            [
+              (.packages."${{ matrix.system }}" // {} | keys[] | ".#packages.${{ matrix.system }}.\(.)"),
+              (.devShells."${{ matrix.system }}" // {} | keys[] | ".#devShells.${{ matrix.system }}.\(.)")
+            ] | .[]
+          ' | xargs nix build
+
+      - name: Run nix flake check
+        run: nix flake check
--- a/.mlx_typings/mlx_lm/models/deepseek_v3.pyi
+++ b/.mlx_typings/mlx_lm/models/deepseek_v3.pyi
@@ -0,0 +1,156 @@
+"""Type stubs for mlx_lm.models.deepseek_v3"""
+
+from dataclasses import dataclass
+from typing import Any, Dict, Optional
+
+import mlx.core as mx
+import mlx.nn as nn
+
+from .base import BaseModelArgs
+from .switch_layers import SwitchGLU
+
+@dataclass
+class ModelArgs(BaseModelArgs):
+    model_type: str
+    vocab_size: int
+    hidden_size: int
+    intermediate_size: int
+    moe_intermediate_size: int
+    num_hidden_layers: int
+    num_attention_heads: int
+    num_key_value_heads: int
+    n_shared_experts: Optional[int]
+    n_routed_experts: Optional[int]
+    routed_scaling_factor: float
+    kv_lora_rank: int
+    q_lora_rank: Optional[int]
+    qk_rope_head_dim: int
+    v_head_dim: int
+    qk_nope_head_dim: int
+    topk_method: str
+    scoring_func: str
+    norm_topk_prob: bool
+    n_group: int
+    topk_group: int
+    num_experts_per_tok: int
+    moe_layer_freq: int
+    first_k_dense_replace: int
+    max_position_embeddings: int
+    rms_norm_eps: float
+    rope_theta: float
+    rope_scaling: Optional[Dict[str, Any]]
+    attention_bias: bool
+
+class DeepseekV3Attention(nn.Module):
+    config: ModelArgs
+    hidden_size: int
+    num_heads: int
+    max_position_embeddings: int
+    rope_theta: float
+    q_lora_rank: Optional[int]
+    qk_rope_head_dim: int
+    kv_lora_rank: int
+    v_head_dim: int
+    qk_nope_head_dim: int
+    q_head_dim: int
+    scale: float
+    q_proj: nn.Linear
+    q_a_proj: nn.Linear
+    q_a_layernorm: nn.RMSNorm
+    q_b_proj: nn.Linear
+    kv_a_proj_with_mqa: nn.Linear
+    kv_a_layernorm: nn.RMSNorm
+    kv_b_proj: nn.Linear
+    o_proj: nn.Linear
+    rope: Any
+
+    def __init__(self, config: ModelArgs) -> None: ...
+    def __call__(
+        self,
+        x: mx.array,
+        mask: Optional[mx.array] = None,
+        cache: Optional[Any] = None,
+    ) -> mx.array: ...
+
+class DeepseekV3MLP(nn.Module):
+    config: ModelArgs
+    hidden_size: int
+    intermediate_size: int
+    gate_proj: nn.Linear
+    up_proj: nn.Linear
+    down_proj: nn.Linear
+
+    def __init__(
+        self,
+        config: ModelArgs,
+        hidden_size: Optional[int] = None,
+        intermediate_size: Optional[int] = None,
+    ) -> None: ...
+    def __call__(self, x: mx.array) -> mx.array: ...
+
+class MoEGate(nn.Module):
+    config: ModelArgs
+    top_k: int
+    norm_topk_prob: bool
+    n_routed_experts: Optional[int]
+    routed_scaling_factor: float
+    n_group: int
+    topk_group: int
+    weight: mx.array
+    e_score_correction_bias: mx.array
+
+    def __init__(self, config: ModelArgs) -> None: ...
+    def __call__(self, x: mx.array) -> tuple[mx.array, mx.array]: ...
+
+class DeepseekV3MoE(nn.Module):
+    config: ModelArgs
+    num_experts_per_tok: int
+    switch_mlp: SwitchGLU
+    gate: MoEGate
+    shared_experts: DeepseekV3MLP
+    sharding_group: Optional[mx.distributed.Group]
+
+    def __init__(self, config: ModelArgs) -> None: ...
+    def __call__(self, x: mx.array) -> mx.array: ...
+
+class DeepseekV3DecoderLayer(nn.Module):
+    self_attn: DeepseekV3Attention
+    mlp: DeepseekV3MLP | DeepseekV3MoE
+    input_layernorm: nn.RMSNorm
+    post_attention_layernorm: nn.RMSNorm
+
+    def __init__(self, config: ModelArgs, layer_idx: int) -> None: ...
+    def __call__(
+        self,
+        x: mx.array,
+        mask: Optional[mx.array] = None,
+        cache: Optional[Any] = None,
+    ) -> mx.array: ...
+
+class DeepseekV3Model(nn.Module):
+    vocab_size: int
+    embed_tokens: nn.Embedding
+    layers: list[DeepseekV3DecoderLayer]
+    norm: nn.RMSNorm
+
+    def __init__(self, config: ModelArgs) -> None: ...
+    def __call__(
+        self,
+        x: mx.array,
+        cache: Optional[Any] = None,
+    ) -> mx.array: ...
+
+class Model(nn.Module):
+    model_type: str
+    model: DeepseekV3Model
+    lm_head: nn.Linear
+
+    def __init__(self, config: ModelArgs) -> None: ...
+    def __call__(
+        self,
+        inputs: mx.array,
+        cache: Optional[Any] = None,
+    ) -> mx.array: ...
+    def sanitize(self, weights: dict[str, Any]) -> dict[str, Any]: ...
+    @property
+    def layers(self) -> list[DeepseekV3DecoderLayer]: ...
--- a/.mlx_typings/mlx_lm/models/switch_layers.pyi
+++ b/.mlx_typings/mlx_lm/models/switch_layers.pyi
@@ -57,6 +57,11 @@ class SwiGLU(nn.Module):
    def __call__(self, x, gate): ...

 class SwitchGLU(nn.Module):
+    gate_proj: SwitchLinear
+    up_proj: SwitchLinear
+    down_proj: SwitchLinear
+    activation: SwiGLU
+
    def __init__(
        self,
        input_dims: int,
--- a/.mlx_typings/mlx_lm/tokenizer_utils.pyi
+++ b/.mlx_typings/mlx_lm/tokenizer_utils.pyi
@@ -4,6 +4,7 @@ This type stub file was generated by pyright.

 from functools import partial
 from pathlib import Path
+from typing import Any

 from transformers import PreTrainedTokenizerFast

@@ -103,37 +104,55 @@ class TokenizerWrapper:
    Accessing any attribute other than the ``detokenizer`` is forwarded to the
    huggingface tokenizer.
    """
-    def __init__(self, tokenizer, detokenizer_class=..., eos_token_ids=...) -> None: ...
-    def add_eos_token(self, token: str):  # -> None:
-        ...
-    @property
-    def has_thinking(self):  # -> bool:
-        ...
-    @property
-    def think_start(self):  # -> str | None:
-        ...
-    @property
-    def think_end(self):  # -> str | None:
-        ...
-    @property
-    def has_tool_calling(self):  # -> bool:
-        ...
-    @property
-    def tool_call_start(self):  # -> str | None:
-        ...
-    @property
-    def tool_call_end(self):  # -> str | None:
-        ...
-    @property
-    def detokenizer(self):  # -> NaiveStreamingDetokenizer:
-        """
-        Get a stateful streaming detokenizer.
-        """

-    def __getattr__(self, attr):  # -> set[Any] | Any:
-        ...
-    def __setattr__(self, attr, value):  # -> None:
-        ...
+    _tokenizer: PreTrainedTokenizerFast
+    eos_token_id: int | None
+    eos_token: str | None
+    bos_token_id: int | None
+    bos_token: str | None
+    vocab_size: int
+    all_special_tokens: list[str]
+
+    def __init__(
+        self,
+        tokenizer: Any,
+        detokenizer_class: Any = ...,
+        eos_token_ids: list[int] | None = ...,
+        chat_template: Any = ...,
+        tool_parser: Any = ...,
+        tool_call_start: str | None = ...,
+        tool_call_end: str | None = ...,
+    ) -> None: ...
+    def encode(self, text: str, **kwargs: Any) -> list[int]: ...
+    def decode(self, token_ids: list[int], **kwargs: Any) -> str: ...
+    def apply_chat_template(
+        self,
+        messages: list[dict[str, Any]],
+        tokenize: bool = False,
+        add_generation_prompt: bool = False,
+        tools: Any = None,
+        **kwargs: Any,
+    ) -> str: ...
+    def get_vocab(self) -> dict[str, int]: ...
+    def add_eos_token(self, token: str) -> None: ...
+    @property
+    def has_thinking(self) -> bool: ...
+    @property
+    def think_start(self) -> str | None: ...
+    @property
+    def think_end(self) -> str | None: ...
+    @property
+    def has_tool_calling(self) -> bool: ...
+    @property
+    def tool_call_start(self) -> str | None: ...
+    @property
+    def tool_call_end(self) -> str | None: ...
+    @property
+    def detokenizer(self) -> NaiveStreamingDetokenizer:
+        """Get a stateful streaming detokenizer."""
+
+    def __getattr__(self, attr: str) -> Any: ...
+    def __setattr__(self, attr: str, value: Any) -> None: ...

 class NewlineTokenizer(PreTrainedTokenizerFast):
    """A tokenizer that replaces newlines with <n> and <n> with new line."""
@@ -146,18 +165,11 @@ class NewlineTokenizer(PreTrainedTokenizerFast):
    def batch_decode(self, *args, **kwargs):  # -> list[str]:
        ...

-def load_tokenizer(
+def load(
    model_path: Path,
-    tokenizer_config_extra=...,
-    return_tokenizer=...,
-    eos_token_ids=...,
-) -> (
-    TokenizerWrapper
-    | type[SPMStreamingDetokenizer]
-    | partial[SPMStreamingDetokenizer]
-    | type[BPEStreamingDetokenizer]
-    | type[NaiveStreamingDetokenizer]
-):
+    tokenizer_config_extra: dict[str, Any] | None = None,
+    eos_token_ids: list[int] | int | None = None,
+) -> TokenizerWrapper:
    """Load a huggingface tokenizer and try to infer the type of streaming
    detokenizer to use.

@@ -165,4 +177,7 @@ def load_tokenizer(
    a Hugging Face repo ID.
    """

-def no_bos_or_eos(sequence: list, bos: int, eos: int) -> list: ...
+# Alias for backward compatibility
+load_tokenizer = load
+
+def no_bos_or_eos(sequence: list[int], bos: int, eos: int) -> list[int]: ...
--- a/MISSED_THINGS.md
+++ b/MISSED_THINGS.md
@@ -0,0 +1,41 @@
+# Missed things
+[X] Log EXO_LIBP2P_NAMESPACE on start in exo/main.py
+[X] Ordering of warmup was changed, which is wrong. It was changed to rank < n-1, then rank=n-1. It should be rank!=0 then rank=0 (this matches the auto_parallel implementation. NOTE: we use a different convention to mlx-lm, our terminal rank is rank=n-1 whereas mlx-lm is rank=0 hence i can see why this was changed wrongly).
+[X] Downloads keying by model_id not shard_metadata (worker/plan.py, worker/main.py).
+[X] Fetching download status of all models on start
+[X] Deduplication of tasks in plan_step.
+[X] resolve_allow_patterns should just be wildcard now.
+[] no mx_barrier in genreate.py mlx_generate at the end.
+[] cache assertion not needed in auto_parallel.py PipelineLastLayer.
+[] GPTOSS support dropped in auto_parallel.py.
+[] sharding changed "all-to-sharded" became _all_to_sharded in auto_parallel.py.
+[] same as above with "sharded-to-all" became _sharded_to_all in auto_parallel.py.
+[] Dropped support for Ministral3Model, DeepseekV32Model, Glm4MoeModel, Qwen3NextModel, GptOssMode in auto_parallel.py.
+[] Dropped prefill/decode code in auto_parallel.py and utils_mlx.py.
+[X] KV_CACHE_BITS should be None to disable quantized KV cache.
+[] Dropped _set_nofile_limit in utils_mlx.py.
+[] We have group optional in load_mlx_items in utils_mlx.py.
+[] Dropped add_missing_chat_templates for GptOss in load_mlx_items in utils_mlx.py.
+[] Dropped model.make_cache in make_kv_cache in utils_mlx.py.
+[X] We put cache limit back in utils_mlx.py.
+[] topology.py remove_node removes the connections after checking if node is is in self._node_id_to_rx_id_map. on beta_1 it checks after, so would remove stale connections I guess?
+[] Missing Glm 4.7 model cards (this isn't ready yet but should be picked up, probably create an issue... the blocker is transforemrs version doesn't support the tokenizer for Glm 4.7. rc-1 does but we can't upgrade as it breaks other things.)
+[] try-except in _command_processor only excepts ValueError. This was silently failing leading to un-debuggable errors (we had a KeyError that was happening ). Changed this to catch Exception instead of ValueError. See exo-v2 89ae38405e0052e3c22405daf094b065878aa873 and fb99fea69b5a39017efc90c5dad0072e677455f0.
+[X] In placement.py, place_instance no longer looks at model_meta.supports_tensor and check if this tensor parallel number of nodes is supported by the model's tensor dimensions.
+[X] In placement.py, place_instanec, we no longer have the special case to exclude DeepSeek v3.1 pipeline parallel (it doesn't work).
+[] logger.warning("You have likely selected ibv for a single node instance; falling back to MlxRing") was changed to debug. That will spam this warning since it happens every time we query instance previews.
+[X] In placement_utils.py, get_mlx_jaccl_coordinators, We no longer prioritise Jaccl Coordinator IP. Now it picks the first one, which is unstable (Jaccl coordinator over TB5 is unstable).
+
+
+
+[X] Downloads keying by model_id not shard_metadata (worker/plan.py, worker/main.py).
+[X] Fetching download status of all models on start
+[X] Deduplication of tasks in plan_step.
+[X] resolve_allow_patterns should just be wildcard now.
+[X] KV_CACHE_BITS should be None to disable quantized KV cache.
+[X] We put cache limit back in utils_mlx.py.
+[X] In placement.py, place_instance no longer looks at model_meta.supports_tensor and check if this tensor parallel number of nodes is supported by the model's tensor dimensions.
+[X] In placement.py, place_instanec, we no longer have the special case to exclude DeepSeek v3.1 pipeline parallel (it doesn't work).
+[X] In placement_utils.py, get_mlx_jaccl_coordinators, We no longer prioritise Jaccl Coordinator IP. Now it picks the first one, which is unstable (Jaccl coordinator over TB5 is unstable).
+
+
--- a/README.md
+++ b/README.md
@@ -305,7 +305,10 @@ curl -X DELETE http://localhost:52415/instance/YOUR_INSTANCE_ID
 - List all models: `curl http://localhost:52415/models`
 - Inspect instance IDs and deployment state: `curl http://localhost:52415/state`

-For further details, see API types and endpoints in [src/exo/master/api.py](src/exo/master/api.py).
+For further details, see:
+
+- API basic documentation in [docs/api.md](docs/api.md).
+- API types and endpoints in [src/exo/master/api.py](src/exo/master/api.py).

 ---

--- a/dashboard/src/routes/downloads/+page.svelte
+++ b/dashboard/src/routes/downloads/+page.svelte
@@ -199,7 +199,13 @@
 					const rawProgress = (downloadPayload as Record<string, unknown>).download_progress
 						?? (downloadPayload as Record<string, unknown>).downloadProgress
 						?? {};
-					const totalBytes = getBytes((rawProgress as Record<string, unknown>).total_bytes ?? (rawProgress as Record<string, unknown>).totalBytes);
+					// For DownloadCompleted, total_bytes is at top level; for DownloadOngoing, it's inside download_progress
+					const totalBytes = getBytes(
+						(downloadPayload as Record<string, unknown>).total_bytes
+						?? (downloadPayload as Record<string, unknown>).totalBytes
+						?? (rawProgress as Record<string, unknown>).total_bytes
+						?? (rawProgress as Record<string, unknown>).totalBytes
+					);
 					const downloadedBytes = getBytes((rawProgress as Record<string, unknown>).downloaded_bytes ?? (rawProgress as Record<string, unknown>).downloadedBytes);
 					const speed = (rawProgress as Record<string, unknown>).speed as number ?? 0;
 					const etaMs = (rawProgress as Record<string, unknown>).eta_ms as number ?? (rawProgress as Record<string, unknown>).etaMs as number ?? 0;
@@ -332,8 +338,13 @@
 								<div class="text-lg font-mono text-white truncate">{node.nodeName}</div>
 								<div class="text-xs text-exo-light-gray font-mono truncate">{node.nodeId}</div>
 							</div>
-							<div class="text-xs font-mono uppercase tracking-wider whitespace-nowrap shrink-0">
-								<span class="text-green-400">{node.models.filter(m => m.status === 'completed').length}</span><span class="text-exo-yellow"> /{node.models.length} models</span>
+							<div class="text-xs font-mono uppercase tracking-wider whitespace-nowrap shrink-0 text-right">
+								<div>
+									<span class="text-green-400">{node.models.filter(m => m.status === 'completed').length}</span><span class="text-exo-yellow"> / {node.models.length} models</span>
+								</div>
+								<div class="text-exo-light-gray normal-case tracking-normal">
+									{formatBytes(node.models.filter(m => m.status === 'completed').reduce((sum, m) => sum + m.totalBytes, 0))} on disk
+								</div>
 							</div>
 						</div>

@@ -385,7 +396,7 @@
 								</div>

 								<div class="flex items-center justify-between text-xs font-mono text-exo-light-gray">
-									<span>{model.status === 'completed' ? 'Completed' : `${formatSpeed(model.speed)} • ETA ${formatEta(model.etaMs)}`}</span>
+									<span>{model.status === 'completed' ? `Completed (${formatBytes(model.totalBytes)})` : `${formatSpeed(model.speed)} • ETA ${formatEta(model.etaMs)}`}</span>
 									{#if model.status !== 'completed'}
 										<span>{model.files.length} file{model.files.length === 1 ? '' : 's'}</span>
 									{/if}
--- a/docs/api.md
+++ b/docs/api.md
@@ -0,0 +1,212 @@
+# EXO API – Technical Reference
+
+This document describes the REST API exposed by the **EXO ** service, as implemented in:
+
+`src/exo/master/api.py`
+
+The API is used to manage model instances in the cluster, inspect cluster state, and perform inference using an OpenAI-compatible interface.
+
+Base URL example:
+
+```
+http://localhost:52415
+```
+
+## 1. General / Meta Endpoints
+
+### Get Master Node ID
+
+**GET** `/node_id`
+
+Returns the identifier of the current master node.
+
+**Response (example):**
+
+```json
+{
+  "node_id": "node-1234"
+}
+```
+
+### Get Cluster State
+
+**GET** `/state`
+
+Returns the current state of the cluster, including nodes and active instances.
+
+**Response:**
+JSON object describing topology, nodes, and instances.
+
+### Get Events
+
+**GET** `/events`
+
+Returns the list of internal events recorded by the master (mainly for debugging and observability).
+
+**Response:**
+Array of event objects.
+
+## 2. Model Instance Management
+
+### Create Instance
+
+**POST** `/instance`
+
+Creates a new model instance in the cluster.
+
+**Request body (example):**
+
+```json
+{
+  "instance": {
+    "model_id": "llama-3.2-1b",
+    "placement": { }
+  }
+}
+```
+
+**Response:**
+JSON description of the created instance.
+
+### Delete Instance
+
+**DELETE** `/instance/{instance_id}`
+
+Deletes an existing instance by ID.
+
+**Path parameters:**
+
+* `instance_id`: string, ID of the instance to delete
+
+**Response:**
+Status / confirmation JSON.
+
+### Get Instance
+
+**GET** `/instance/{instance_id}`
+
+Returns details of a specific instance.
+
+**Path parameters:**
+
+* `instance_id`: string
+
+**Response:**
+JSON description of the instance.
+
+### Preview Placements
+
+**GET** `/instance/previews?model_id=...`
+
+Returns possible placement previews for a given model.
+
+**Query parameters:**
+
+* `model_id`: string, required
+
+**Response:**
+Array of placement preview objects.
+
+### Compute Placement
+
+**GET** `/instance/placement`
+
+Computes a placement for a potential instance without creating it.
+
+**Query parameters (typical):**
+
+* `model_id`: string
+* `sharding`: string or config
+* `instance_meta`: JSON-encoded metadata
+* `min_nodes`: integer
+
+**Response:**
+JSON object describing the proposed placement / instance configuration.
+
+### Place Instance (Dry Operation)
+
+**POST** `/place_instance`
+
+Performs a placement operation for an instance (planning step), without necessarily creating it.
+
+**Request body:**
+JSON describing the instance to be placed.
+
+**Response:**
+Placement result.
+
+## 3. Models
+
+### List Models
+
+**GET** `/models`
+**GET** `/v1/models` (alias)
+
+Returns the list of available models and their metadata.
+
+**Response:**
+Array of model descriptors.
+
+## 4. Inference / Chat Completions
+
+### OpenAI-Compatible Chat Completions
+
+**POST** `/v1/chat/completions`
+
+Executes a chat completion request using an OpenAI-compatible schema. Supports streaming and non-streaming modes.
+
+**Request body (example):**
+
+```json
+{
+  "model": "llama-3.2-1b",
+  "messages": [
+    { "role": "system", "content": "You are a helpful assistant." },
+    { "role": "user", "content": "Hello" }
+  ],
+  "stream": false
+}
+```
+
+**Response:**
+OpenAI-compatible chat completion response.
+
+### Benchmarked Chat Completions
+
+**POST** `/bench/chat/completions`
+
+Same as `/v1/chat/completions`, but also returns performance and generation statistics.
+
+**Request body:**
+Same schema as `/v1/chat/completions`.
+
+**Response:**
+Chat completion plus benchmarking metrics.
+
+## 5. Complete Endpoint Summary
+
+```
+GET     /node_id
+GET     /state
+GET     /events
+
+POST    /instance
+GET     /instance/{instance_id}
+DELETE  /instance/{instance_id}
+
+GET     /instance/previews
+GET     /instance/placement
+POST    /place_instance
+
+GET     /models
+GET     /v1/models
+
+POST    /v1/chat/completions
+POST    /bench/chat/completions
+```
+
+## 6. Notes
+
+* The `/v1/chat/completions` endpoint is compatible with the OpenAI API format, so existing OpenAI clients can be pointed to EXO by changing the base URL.
+* The instance placement endpoints allow you to plan and preview cluster allocations before actually creating instances.
+* The `/events` and `/state` endpoints are primarily intended for operational visibility and debugging.
--- a/flake.nix
+++ b/flake.nix
@@ -16,12 +16,11 @@
    };
  };

-  # TODO: figure out caching story
-  # nixConfig = {
-  #   # nix community cachix
-  #   extra-trusted-public-keys = "nix-community.cachix.org-1:mB9FSh9qf2dCimDSUo8Zy7bkq5CX+/rkCWyvRCYg3Fs=";
-  #   extra-substituters = "https://nix-community.cachix.org";
-  # };
+  nixConfig = {
+    # nix community cachix
+    extra-trusted-public-keys = "exo.cachix.org-1:okq7hl624TBeAR3kV+g39dUFSiaZgLRkLsFBCuJ2NZI=";
+    extra-substituters = "https://exo.cachix.org";
+  };

  outputs =
    inputs:
@@ -73,6 +72,9 @@
          packages =
            with pkgs;
            [
+              # FORMATTING
+              treefmtEval.config.build.wrapper
+
              # PYTHON
              python313
              uv
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -17,9 +17,9 @@ dependencies = [
    "loguru>=0.7.3",
    "exo_pyo3_bindings", # rust bindings
    "anyio==4.11.0",
-    "mlx>=0.30.1; sys_platform == 'darwin'",
-    "mlx[cpu]>=0.30.1; sys_platform == 'linux'",
-    "mlx-lm>=0.28.3",
+    "mlx==0.30.1; sys_platform == 'darwin'",
+    "mlx[cpu]==0.30.1; sys_platform == 'linux'",
+    "mlx-lm @ git+https://github.com/AlexCheema/mlx-lm.git@fix-transformers-5.0.0rc2",
    "tiktoken>=0.12.0", # required for kimi k2 tokenizer
    "hypercorn>=0.18.0",
    "openai-harmony>=0.0.8",
@@ -33,6 +33,7 @@ exo = "exo.main:main"
 # dependencies only required for development
 [dependency-groups]
 dev = [
+    "basedpyright>=1.29.0",
    "pyinstaller>=6.17.0",
    "pytest>=8.4.0",
    "pytest-asyncio>=1.0.0",
@@ -98,6 +99,7 @@ root = "src"

 # supported platforms for this project
 [tool.uv]
+prerelease = "allow"
 environments = [
    "sys_platform == 'darwin'",
    "sys_platform == 'linux'",
--- a/src/exo/shared/tests/test_apply/test_apply_node_download.py
+++ b/src/exo/shared/tests/test_apply/test_apply_node_download.py
@@ -2,6 +2,7 @@ from exo.shared.apply import apply_node_download_progress
 from exo.shared.tests.conftest import get_pipeline_shard_metadata
 from exo.shared.types.common import NodeId
 from exo.shared.types.events import NodeDownloadProgress
+from exo.shared.types.memory import Memory
 from exo.shared.types.state import State
 from exo.shared.types.worker.downloads import DownloadCompleted
 from exo.worker.tests.constants import MODEL_A_ID, MODEL_B_ID
@@ -13,6 +14,7 @@ def test_apply_node_download_progress():
    event = DownloadCompleted(
        node_id=NodeId("node-1"),
        shard_metadata=shard1,
+        total_bytes=Memory(),
    )

    new_state = apply_node_download_progress(
@@ -28,10 +30,12 @@ def test_apply_two_node_download_progress():
    event1 = DownloadCompleted(
        node_id=NodeId("node-1"),
        shard_metadata=shard1,
+        total_bytes=Memory(),
    )
    event2 = DownloadCompleted(
        node_id=NodeId("node-1"),
        shard_metadata=shard2,
+        total_bytes=Memory(),
    )
    state = State(downloads={NodeId("node-1"): [event1]})

--- a/src/exo/shared/types/worker/downloads.py
+++ b/src/exo/shared/types/worker/downloads.py
@@ -28,7 +28,7 @@ class DownloadPending(BaseDownloadProgress):


 class DownloadCompleted(BaseDownloadProgress):
-    pass
+    total_bytes: Memory


 class DownloadFailed(BaseDownloadProgress):
--- a/src/exo/worker/engines/mlx/auto_parallel.py
+++ b/src/exo/worker/engines/mlx/auto_parallel.py
@@ -10,18 +10,23 @@ from mlx.nn.layers.distributed import (
    shard_linear,
    sum_gradients,
 )
-from mlx_lm.models.cache import (
-    _BaseCache,  # pyright: ignore[reportPrivateUsage]
-)
 from mlx_lm.models.deepseek_v3 import DeepseekV3MLP
 from mlx_lm.models.deepseek_v3 import Model as DeepseekV3Model
+from mlx_lm.models.deepseek_v32 import DeepseekV32MLP
+from mlx_lm.models.deepseek_v32 import Model as DeepseekV32Model
+from mlx_lm.models.ministral3 import Model as Ministral3Model
+from mlx_lm.models.gpt_oss import GptOssMoeModel
+from mlx_lm.models.gpt_oss import Model as GptOssModel
 from mlx_lm.models.llama import Model as LlamaModel
 from mlx_lm.models.qwen3_moe import Model as Qwen3MoeModel
 from mlx_lm.models.qwen3_moe import Qwen3MoeSparseMoeBlock
+from mlx_lm.models.qwen3_next import Model as Qwen3NextModel
+from mlx_lm.models.qwen3_next import Qwen3NextSparseMoeBlock
+from mlx_lm.models.glm4_moe import Model as Glm4MoeModel
+from mlx_lm.models.glm4_moe import MoE

-from exo.shared.types.worker.shards import (
-    PipelineShardMetadata,
-)
+from exo.shared.logging import logger
+from exo.shared.types.worker.shards import PipelineShardMetadata


 class _LayerCallable(Protocol):
@@ -69,6 +74,7 @@ class PipelineFirstLayer(CustomMlxLayer):
    def __call__(self, x: mx.array, *args: object, **kwargs: object) -> mx.array:
        if self.r != 0:
            x = mx.distributed.recv_like(x, (self.r - 1), group=self.group)
+            # mx.eval(x)
        return self.original_layer(x, *args, **kwargs)


@@ -91,8 +97,6 @@ class PipelineLastLayer(CustomMlxLayer):
            x, *args, **kwargs
        ).arguments.get("cache", None)

-        assert cache is None or issubclass(type(cache), _BaseCache)  # type: ignore
-
        output: mx.array = self.original_layer(x, *args, **kwargs)

        if self.r != self.s - 1:
@@ -100,7 +104,6 @@ class PipelineLastLayer(CustomMlxLayer):
                output, (self.r + 1) % self.s, group=self.group
            )
            if cache is not None:
-                # This change happened upstream - check out mlx github somewhere??
                cache.keys = mx.depends(cache.keys, output)  # type: ignore[reportUnknownMemberType]

        output = mx.distributed.all_gather(output, group=self.group)[-output.shape[0] :]
@@ -132,24 +135,6 @@ def _get_layers(inner_model_instance: nn.Module) -> list[_LayerCallable]:
    return layers


-def _set_layers(model: nn.Module, layers: list[_LayerCallable]) -> None:
-    inner_model_instance = _inner_model(model)
-    if hasattr(inner_model_instance, "layers"):
-        inner_model_instance.layers = layers
-
-        # Update DeepSeek V3 specific parameters when layers are shrunk
-        if isinstance(model, DeepseekV3Model) and hasattr(
-            inner_model_instance, "num_layers"
-        ):
-            inner_model_instance.start_idx = 0
-            inner_model_instance.end_idx = len(layers)
-            inner_model_instance.num_layers = len(layers)
-    elif hasattr(inner_model_instance, "h"):
-        inner_model_instance.h = layers
-    else:
-        raise ValueError("Model must have either a 'layers' or 'h' attribute")
-
-
 def pipeline_auto_parallel(
    model: nn.Module,
    group: mx.distributed.Group,
@@ -165,8 +150,7 @@ def pipeline_auto_parallel(
    """
    inner_model_instance: nn.Module = _inner_model(model)

-    # Handle both model.layers and model.h cases
-    layers: list[_LayerCallable] = _get_layers(inner_model_instance)
+    layers = _get_layers(inner_model_instance)

    start_layer, end_layer = model_shard_meta.start_layer, model_shard_meta.end_layer
    device_rank, world_size = model_shard_meta.device_rank, model_shard_meta.world_size
@@ -180,6 +164,17 @@ def pipeline_auto_parallel(
        group=group,
    )

+    if isinstance(inner_model_instance, GptOssMoeModel):
+        inner_model_instance.layer_types = inner_model_instance.layer_types[  # type: ignore
+            start_layer:end_layer
+        ]
+        inner_model_instance.swa_idx = inner_model_instance.layer_types.index(  # type: ignore
+            "sliding_attention"
+        )
+        inner_model_instance.ga_idx = inner_model_instance.layer_types.index(  # type: ignore
+            "full_attention"
+        )
+
    _set_layers(model, layers)

    assert isinstance(layers, list), (
@@ -204,18 +199,44 @@ def tensor_auto_parallel(
        group=group,
    )

+    SEGMENTS: int = 1
+
+    def _all_to_sharded(path: str, weight: mx.array):
+        if path.endswith("bias"):
+            logger.info(f"Sharding bias for {path} - all to sharded")
+            return weight.ndim - 1, SEGMENTS
+        return max(weight.ndim - 2, 0), SEGMENTS
+
    all_to_sharded_linear_in_place = partial(
        shard_inplace,
-        sharding="all-to-sharded",
-        group=group,
-    )
-    sharded_to_all_linear_in_place = partial(
-        shard_inplace,
-        sharding="sharded-to-all",
+        sharding=_all_to_sharded, # type: ignore
        group=group,
    )

-    if isinstance(model, LlamaModel):
+    N = group.size()
+
+    def _sharded_to_all(path: str, weight: mx.array):
+        if path.endswith("bias"):
+            logger.info(f"Sharding bias for {path} - sharded to all")
+            weight /= N
+            return None
+        return -1, SEGMENTS
+
+    sharded_to_all_linear_in_place = partial(
+        shard_inplace,
+        sharding=_sharded_to_all, # type: ignore
+        group=group,
+    )
+
+    if hasattr(model, "shard"):
+        try:
+            model.shard(group) # type: ignore
+            return model
+        except (AttributeError, TypeError, NameError):
+            pass
+
+    if isinstance(model, (LlamaModel, Ministral3Model)):
+        logger.warning("shouldn't be hit - upstream sharding exists")
        tensor_parallel_sharding_strategy = LlamaShardingStrategy(
            group,
            all_to_sharded_linear,
@@ -223,7 +244,8 @@ def tensor_auto_parallel(
            all_to_sharded_linear_in_place,
            sharded_to_all_linear_in_place,
        )
-    elif isinstance(model, DeepseekV3Model):
+    elif isinstance(model, (DeepseekV3Model, DeepseekV32Model)):
+        logger.warning("shouldn't be hit - upstream sharding exists")
        tensor_parallel_sharding_strategy = DeepSeekShardingStrategy(
            group,
            all_to_sharded_linear,
@@ -231,7 +253,7 @@ def tensor_auto_parallel(
            all_to_sharded_linear_in_place,
            sharded_to_all_linear_in_place,
        )
-    elif isinstance(model, Qwen3MoeModel):
+    elif isinstance(model, (Qwen3MoeModel, Glm4MoeModel, Qwen3NextModel)):
        tensor_parallel_sharding_strategy = QwenShardingStrategy(
            group,
            all_to_sharded_linear,
@@ -239,6 +261,15 @@ def tensor_auto_parallel(
            all_to_sharded_linear_in_place,
            sharded_to_all_linear_in_place,
        )
+    elif isinstance(model, GptOssModel):
+        tensor_parallel_sharding_strategy = GptOssShardingStrategy(
+            group,
+            all_to_sharded_linear,
+            sharded_to_all_linear,
+            all_to_sharded_linear_in_place,
+            sharded_to_all_linear_in_place,
+        )
+
    else:
        raise ValueError(f"Unsupported model type: {type(model)}")

@@ -284,13 +315,38 @@ class LlamaShardingStrategy(TensorParallelShardingStrategy):
        return model


+def _set_layers(model: nn.Module, layers: list[_LayerCallable]) -> None:
+    inner_model_instance = _inner_model(model)
+    if hasattr(inner_model_instance, "layers"):
+        inner_model_instance.layers = layers
+
+        # Update DeepSeek V3 specific parameters when layers are shrunk
+        if isinstance(model, (DeepseekV3Model, DeepseekV32Model, Glm4MoeModel)) and hasattr(
+            inner_model_instance, "num_layers"
+        ):
+            logger.info(
+                f"Setting num_layers to {len(layers)} for model {model.model.__class__.__name__}"
+            )
+            inner_model_instance.start_idx = 0
+            inner_model_instance.end_idx = len(layers)
+            inner_model_instance.num_layers = len(layers)
+        elif isinstance(model, Qwen3MoeModel):
+            logger.info(
+                f"Setting num_hidden_layers to {len(layers)} for model {model.model.__class__.__name__}"
+            )
+            inner_model_instance.num_hidden_layers = len(layers)
+    elif hasattr(inner_model_instance, "h"):
+        inner_model_instance.h = layers
+    else:
+        raise ValueError("Model must have either a 'layers' or 'h' attribute")
+
+
 class DeepSeekShardingStrategy(TensorParallelShardingStrategy):
    def shard_model(self, model: nn.Module) -> nn.Module:
        model = cast(DeepseekV3Model, model)
        for layer in model.layers:
            # Shard the self attention
-            if layer.self_attn.q_lora_rank is None:  # pyright: ignore[reportUnnecessaryComparison]
-                # Unfortunately, q_lora_rank can be None despite typing hints.
+            if layer.self_attn.q_lora_rank is None:
                layer.self_attn.q_proj = self.all_to_sharded_linear(
                    layer.self_attn.q_proj
                )
@@ -305,7 +361,7 @@ class DeepSeekShardingStrategy(TensorParallelShardingStrategy):
            layer.self_attn.num_heads //= self.N

            # Shard the MLP
-            if isinstance(layer.mlp, DeepseekV3MLP):
+            if isinstance(layer.mlp, (DeepseekV3MLP, DeepseekV32MLP)):
                layer.mlp.gate_proj = self.all_to_sharded_linear(layer.mlp.gate_proj)
                layer.mlp.down_proj = self.sharded_to_all_linear(layer.mlp.down_proj)
                layer.mlp.up_proj = self.all_to_sharded_linear(layer.mlp.up_proj)
@@ -353,7 +409,7 @@ class QwenShardingStrategy(TensorParallelShardingStrategy):

            # Shard the MoE. Shard in place since the MoE should be responsible
            # for aggregating the results.
-            if isinstance(layer.mlp, Qwen3MoeSparseMoeBlock):
+            if isinstance(layer.mlp, (Qwen3MoeSparseMoeBlock, MoE, Qwen3NextSparseMoeBlock)):
                self.all_to_sharded_linear_in_place(layer.mlp.switch_mlp.gate_proj)
                self.sharded_to_all_linear_in_place(layer.mlp.switch_mlp.down_proj)
                self.all_to_sharded_linear_in_place(layer.mlp.switch_mlp.up_proj)
@@ -381,3 +437,52 @@ class ShardedQwenMoE(CustomMlxLayer):
        if self.sharding_group is not None:
            y = mx.distributed.all_sum(y, group=self.sharding_group)
        return y
+
+
+class GptOssShardingStrategy(TensorParallelShardingStrategy):
+    def shard_model(self, model: nn.Module) -> nn.Module:
+        model = cast(GptOssMoeModel, model)
+
+        for layer in model.layers:
+            layer.self_attn.q_proj = self.all_to_sharded_linear(layer.self_attn.q_proj)
+            layer.self_attn.k_proj = self.all_to_sharded_linear(layer.self_attn.k_proj)
+            layer.self_attn.v_proj = self.all_to_sharded_linear(layer.self_attn.v_proj)
+            layer.self_attn.o_proj = self.sharded_to_all_linear(layer.self_attn.o_proj)
+
+            layer.self_attn.num_attention_heads //= self.N
+            layer.self_attn.num_key_value_heads //= self.N
+            layer.self_attn.num_key_value_groups = (
+                layer.self_attn.num_attention_heads
+                // layer.self_attn.num_key_value_heads
+            )
+
+            layer.self_attn.sinks = layer.self_attn.sinks[
+                layer.self_attn.num_attention_heads
+                * self.group.rank() : layer.self_attn.num_attention_heads
+                * (self.group.rank() + 1)
+            ]
+
+            self.all_to_sharded_linear_in_place(layer.mlp.experts.gate_proj)
+            self.sharded_to_all_linear_in_place(layer.mlp.experts.down_proj)
+            self.all_to_sharded_linear_in_place(layer.mlp.experts.up_proj)
+
+            layer.mlp = ShardedGptOssMoE(layer.mlp) # type: ignore
+            layer.mlp.sharding_group = self.group
+
+        return model
+
+
+class ShardedGptOssMoE(CustomMlxLayer):
+    def __init__(self, layer: nn.Module):
+        super().__init__(layer)
+        self.sharding_group: mx.distributed.Group | None = None
+
+    def __call__(self, x: mx.array) -> mx.array:
+        if self.sharding_group is not None:
+            x = sum_gradients(self.sharding_group)(x)
+        y = self.original_layer(x)
+        if self.sharding_group is not None:
+            y = mx.distributed.all_sum(y, group=self.sharding_group)
+        return y
+
+
--- a/src/exo/worker/engines/mlx/utils_mlx.py
+++ b/src/exo/worker/engines/mlx/utils_mlx.py
@@ -1,10 +1,23 @@
 import json
 import os
 import resource
+import sys
 import time
 from pathlib import Path
 from typing import Any, cast

+# Monkey-patch for transformers 5.x compatibility
+# Kimi's tokenization_kimi.py imports bytes_to_unicode from the old location
+# which was moved in transformers 5.0.0rc2
+try:
+    import transformers.models.gpt2.tokenization_gpt2 as gpt2_tokenization
+    from transformers.convert_slow_tokenizer import bytes_to_unicode
+
+    if not hasattr(gpt2_tokenization, "bytes_to_unicode"):
+        gpt2_tokenization.bytes_to_unicode = bytes_to_unicode  # type: ignore[attr-defined]
+except ImportError:
+    pass  # transformers < 5.0 or bytes_to_unicode not available
+
 from mlx_lm.models.cache import KVCache, QuantizedKVCache, RotatingKVCache
 from mlx_lm.models.deepseek_v3 import DeepseekV3Model
 from mlx_lm.tokenizer_utils import TokenizerWrapper
@@ -18,7 +31,7 @@ from exo.worker.engines.mlx.constants import (
 try:
    from mlx_lm.tokenizer_utils import load_tokenizer
 except ImportError:
-    from mlx_lm.tokenizer_utils import load as load_tokenizer  # type: ignore
+    from mlx_lm.tokenizer_utils import load as load_tokenizer
 import contextlib

 import mlx.core as mx
@@ -252,26 +265,70 @@ def shard_and_load(
    return model, tokenizer


-def get_tokenizer(model_path: Path, shard_metadata: ShardMetadata):
-    # TODO: Let's move away from this custom logic to mlx_lm.load()
-    if "kimi-k2" in shard_metadata.model_meta.model_id.lower():
-        eos_token_ids = [163586]
+def get_tokenizer(model_path: Path, shard_metadata: ShardMetadata) -> TokenizerWrapper:
+    """Load tokenizer for a model shard. Delegates to load_tokenizer_for_model_id."""
+    return load_tokenizer_for_model_id(shard_metadata.model_meta.model_id, model_path)

-    elif "glm" in shard_metadata.model_meta.model_id.lower():
-        eos_token_ids = [151336, 151329, 151338]

-    else:
-        eos_token_ids = None
+def get_eos_token_ids_for_model(model_id: str) -> list[int] | None:
+    """
+    Get the EOS token IDs for a model based on its ID.

-    tokenizer = cast(
-        TokenizerWrapper,
-        load_tokenizer(
-            model_path,
-            tokenizer_config_extra={"trust_remote_code": TRUST_REMOTE_CODE},
-            eos_token_ids=eos_token_ids,
-        ),
+    Some models require explicit EOS token configuration that isn't in their
+    tokenizer config. This function returns the known EOS token IDs for such models.
+
+    Args:
+        model_id: The HuggingFace model ID
+
+    Returns:
+        List of EOS token IDs, or None if the model uses standard tokenizer config
+    """
+    model_id_lower = model_id.lower()
+    if "kimi-k2" in model_id_lower:
+        return [163586]
+    elif "glm" in model_id_lower:
+        return [151336, 151329, 151338]
+    return None
+
+
+def load_tokenizer_for_model_id(model_id: str, model_path: Path) -> TokenizerWrapper:
+    """
+    Load tokenizer for a model given its ID and local path.
+
+    This is the core tokenizer loading logic, handling special cases for different
+    model families (Kimi, GLM, etc.) and transformers 5.x compatibility.
+
+    Args:
+        model_id: The HuggingFace model ID (e.g., "moonshotai/Kimi-K2-Instruct")
+        model_path: Local path where the model/tokenizer files are stored
+
+    Returns:
+        TokenizerWrapper instance configured for the model
+    """
+    model_id_lower = model_id.lower()
+    eos_token_ids = get_eos_token_ids_for_model(model_id)
+
+    # Kimi uses a custom TikTokenTokenizer that transformers 5.x can't load via AutoTokenizer
+    if "kimi-k2" in model_id_lower:
+        sys.path.insert(0, str(model_path))
+        from tokenization_kimi import TikTokenTokenizer  # type: ignore[import-not-found]  # noqa: I001
+
+        hf_tokenizer: Any = TikTokenTokenizer.from_pretrained(model_path)  # pyright: ignore[reportUnknownVariableType,reportUnknownMemberType]
+
+        # Patch encode to use internal tiktoken model directly
+        # transformers 5.x has a bug in the encode->pad path for slow tokenizers
+        def _patched_encode(text: str, **_kwargs: object) -> list[int]:
+            # Pass allowed_special="all" to handle special tokens like <|im_user|>
+            return list(hf_tokenizer.model.encode(text, allowed_special="all"))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
+
+        hf_tokenizer.encode = _patched_encode
+        return TokenizerWrapper(hf_tokenizer, eos_token_ids=eos_token_ids)
+
+    tokenizer = load_tokenizer(
+        model_path,
+        tokenizer_config_extra={"trust_remote_code": TRUST_REMOTE_CODE},
+        eos_token_ids=eos_token_ids,
    )
-    assert isinstance(tokenizer, TokenizerWrapper)

    return tokenizer

@@ -301,14 +358,14 @@ def apply_chat_template(
            {k: v for k, v in message.model_dump().items() if v is not None}  # type: ignore
        )

-    prompt: str = tokenizer.apply_chat_template(  # type: ignore
+    prompt: str = tokenizer.apply_chat_template(
        formatted_messages,
        tokenize=False,
        add_generation_prompt=True,
        tools=chat_task_data.tools,
    )

-    return prompt  # type: ignore
+    return prompt


 class NullKVCache(KVCache):
--- a/src/exo/worker/main.py
+++ b/src/exo/worker/main.py
@@ -217,7 +217,9 @@ class Worker:
                    )
                    if initial_progress.status == "complete":
                        progress = DownloadCompleted(
-                            shard_metadata=shard, node_id=self.node_id
+                            shard_metadata=shard,
+                            node_id=self.node_id,
+                            total_bytes=initial_progress.total_bytes,
                        )
                        self.download_status[shard.model_meta.model_id] = progress
                        await self.event_sender.send(
@@ -364,7 +366,11 @@ class Worker:
            nonlocal self
            nonlocal last_progress_time
            if progress.status == "complete":
-                status = DownloadCompleted(shard_metadata=shard, node_id=self.node_id)
+                status = DownloadCompleted(
+                    shard_metadata=shard,
+                    node_id=self.node_id,
+                    total_bytes=progress.total_bytes,
+                )
                self.download_status[shard.model_meta.model_id] = status
                # Footgun!
                self.event_sender.send_nowait(
@@ -457,7 +463,9 @@ class Worker:
                ) in self.shard_downloader.get_shard_download_status():
                    if progress.status == "complete":
                        status = DownloadCompleted(
-                            node_id=self.node_id, shard_metadata=progress.shard
+                            node_id=self.node_id,
+                            shard_metadata=progress.shard,
+                            total_bytes=progress.total_bytes,
                        )
                    elif progress.status in ["in_progress", "not_started"]:
                        if progress.downloaded_bytes_this_session.in_bytes == 0:
--- a/src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py
+++ b/src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py
@@ -0,0 +1,386 @@
+"""
+Unit tests for tokenizer loading and functionality across all supported models.
+
+This test downloads only tokenizer-related files (not full model weights) to verify
+that tokenizers can be loaded and used correctly for encoding/decoding.
+"""
+
+import asyncio
+import contextlib
+from pathlib import Path
+
+import pytest
+
+from exo.shared.models.model_cards import MODEL_CARDS, ModelCard
+from exo.worker.download.download_utils import (
+    download_file_with_retry,
+    ensure_models_dir,
+    fetch_file_list_with_cache,
+)
+from exo.worker.engines.mlx.utils_mlx import (
+    get_eos_token_ids_for_model,
+    load_tokenizer_for_model_id,
+)
+
+# Files needed for tokenizer functionality
+TOKENIZER_FILE_PATTERNS = [
+    "tokenizer.json",
+    "tokenizer_config.json",
+    "special_tokens_map.json",
+    "vocab.json",
+    "vocab.txt",
+    "merges.txt",
+    "tiktoken.model",
+    "added_tokens.json",
+    "tokenizer.model",
+    "tokenization_*.py",  # Custom tokenizer implementations
+]
+
+
+def is_tokenizer_file(filename: str) -> bool:
+    """Check if a file is needed for tokenizer functionality."""
+    for pattern in TOKENIZER_FILE_PATTERNS:
+        if "*" in pattern:
+            prefix = pattern.split("*")[0]
+            suffix = pattern.split("*")[1]
+            if filename.startswith(prefix) and filename.endswith(suffix):
+                return True
+        elif filename == pattern:
+            return True
+    return False
+
+
+async def download_tokenizer_files(model_id: str) -> Path:
+    """Download only the tokenizer-related files for a model."""
+    target_dir = await ensure_models_dir() / model_id.replace("/", "--")
+    target_dir.mkdir(parents=True, exist_ok=True)
+
+    file_list = await fetch_file_list_with_cache(model_id, "main", recursive=True)
+
+    tokenizer_files = [f for f in file_list if is_tokenizer_file(f.path)]
+
+    if not tokenizer_files:
+        pytest.skip(f"No tokenizer files found for {model_id}")
+
+    for file_entry in tokenizer_files:
+        with contextlib.suppress(FileNotFoundError):
+            await download_file_with_retry(
+                model_id, "main", file_entry.path, target_dir
+            )
+
+    return target_dir
+
+
+# Get a sample of models to test (one per family to keep tests fast)
+def get_test_models() -> list[tuple[str, ModelCard]]:
+    """Get a representative sample of models to test."""
+    # Pick one model from each family to test
+    families: dict[str, tuple[str, ModelCard]] = {}
+    for short_id, card in MODEL_CARDS.items():
+        # Extract family name (e.g., "llama-3.1" from "llama-3.1-8b")
+        parts = short_id.split("-")
+        family = "-".join(parts[:2]) if len(parts) >= 2 else parts[0]
+
+        if family not in families:
+            families[family] = (short_id, card)
+
+    return list(families.values())
+
+
+TEST_MODELS: list[tuple[str, ModelCard]] = get_test_models()
+
+
+@pytest.fixture(scope="module")
+def event_loop():
+    """Create event loop for async tests."""
+    loop = asyncio.new_event_loop()
+    yield loop
+    loop.close()
+
+
+@pytest.mark.parametrize(
+    "short_id,model_card",
+    TEST_MODELS,
+    ids=[m[0] for m in TEST_MODELS],
+)
+@pytest.mark.asyncio
+async def test_tokenizer_encode_decode(short_id: str, model_card: ModelCard) -> None:
+    """Test that tokenizer can encode and decode text correctly."""
+    model_id = str(model_card.model_id)
+
+    # Download tokenizer files
+    model_path = await download_tokenizer_files(model_id)
+
+    # Verify required files exist
+    has_tokenizer = (
+        (model_path / "tokenizer.json").exists()
+        or (model_path / "tokenizer_config.json").exists()
+        or (model_path / "tiktoken.model").exists()
+        or (model_path / "tokenizer.model").exists()
+    )
+    if not has_tokenizer:
+        pytest.skip(f"Required tokenizer files not found for {model_id}")
+
+    # Load tokenizer
+    tokenizer = load_tokenizer_for_model_id(model_id, model_path)
+
+    # Test basic encoding
+    test_text = "Hello, world!"
+    encoded = tokenizer.encode(test_text)
+    assert isinstance(encoded, list), f"encode() should return a list for {model_id}"
+    assert len(encoded) > 0, f"encode() should return non-empty list for {model_id}"
+    assert all(isinstance(t, int) for t in encoded), (
+        f"All tokens should be integers for {model_id}"
+    )
+
+    # Test decoding
+    decoded = tokenizer.decode(encoded)
+    assert isinstance(decoded, str), f"decode() should return a string for {model_id}"
+    assert test_text in decoded or decoded.strip() == test_text.strip(), (
+        f"decode(encode(x)) should preserve text for {model_id}: got {decoded!r}"
+    )
+
+    # Test with longer text
+    long_text = "The quick brown fox jumps over the lazy dog. " * 10
+    long_encoded = tokenizer.encode(long_text)
+    assert len(long_encoded) > len(encoded), (
+        f"Longer text should produce more tokens for {model_id}"
+    )
+
+    # Test empty string
+    empty_encoded = tokenizer.encode("")
+    assert isinstance(empty_encoded, list), (
+        f"encode('') should return a list for {model_id}"
+    )
+
+    # Test special characters
+    special_text = 'Hello!\n\tWorld? <test> & "quotes"'
+    special_encoded = tokenizer.encode(special_text)
+    assert len(special_encoded) > 0, f"Special chars should encode for {model_id}"
+
+    # Test unicode
+    unicode_text = "Hello 世界 🌍"
+    unicode_encoded = tokenizer.encode(unicode_text)
+    assert len(unicode_encoded) > 0, f"Unicode should encode for {model_id}"
+
+
+@pytest.mark.parametrize(
+    "short_id,model_card",
+    TEST_MODELS,
+    ids=[m[0] for m in TEST_MODELS],
+)
+@pytest.mark.asyncio
+async def test_tokenizer_has_required_attributes(
+    short_id: str, model_card: ModelCard
+) -> None:
+    """Test that tokenizer has required attributes for inference."""
+    model_id = str(model_card.model_id)
+
+    model_path = await download_tokenizer_files(model_id)
+
+    has_tokenizer = (
+        (model_path / "tokenizer.json").exists()
+        or (model_path / "tokenizer_config.json").exists()
+        or (model_path / "tiktoken.model").exists()
+        or (model_path / "tokenizer.model").exists()
+    )
+    if not has_tokenizer:
+        pytest.skip(f"Required tokenizer files not found for {model_id}")
+
+    tokenizer = load_tokenizer_for_model_id(model_id, model_path)
+    eos_token_ids = get_eos_token_ids_for_model(model_id)
+
+    # Check for vocabulary size
+    empty_vocab: dict[str, int] = {}
+    vocab_size: int = getattr(tokenizer, "vocab_size", None) or len(
+        getattr(tokenizer, "get_vocab", lambda: empty_vocab)()
+    )
+    assert vocab_size > 0, f"Tokenizer should have vocab_size > 0 for {model_id}"
+
+    # Check for EOS token (either from tokenizer or explicitly provided)
+    has_eos = (
+        eos_token_ids is not None
+        or getattr(tokenizer, "eos_token_id", None) is not None
+        or getattr(tokenizer, "eos_token", None) is not None
+    )
+    assert has_eos, f"Tokenizer should have EOS token for {model_id}"
+
+
+@pytest.mark.parametrize(
+    "short_id,model_card",
+    TEST_MODELS,
+    ids=[m[0] for m in TEST_MODELS],
+)
+@pytest.mark.asyncio
+async def test_tokenizer_special_tokens(short_id: str, model_card: ModelCard) -> None:
+    """Test that tokenizer can encode text containing special tokens.
+
+    This is critical because the actual inference path uses prompts with
+    special tokens from chat templates. If special tokens aren't handled
+    correctly, encoding will fail.
+    """
+    model_id = str(model_card.model_id)
+
+    model_path = await download_tokenizer_files(model_id)
+
+    has_tokenizer = (
+        (model_path / "tokenizer.json").exists()
+        or (model_path / "tokenizer_config.json").exists()
+        or (model_path / "tiktoken.model").exists()
+        or (model_path / "tokenizer.model").exists()
+    )
+    assert has_tokenizer, f"Required tokenizer files not found for {model_id}"
+
+    tokenizer = load_tokenizer_for_model_id(model_id, model_path)
+
+    # Get special tokens from the tokenizer
+    special_tokens: list[str] = []
+
+    # Try to get special tokens from various sources
+    if hasattr(tokenizer, "all_special_tokens"):
+        special_tokens.extend(tokenizer.all_special_tokens)
+    elif hasattr(tokenizer, "_tokenizer") and hasattr(
+        tokenizer._tokenizer,
+        "all_special_tokens",
+    ):
+        special_tokens.extend(tokenizer._tokenizer.all_special_tokens)
+
+    # Also check for common special token attributes
+    for attr in [
+        "bos_token",
+        "eos_token",
+        "pad_token",
+        "unk_token",
+        "sep_token",
+        "cls_token",
+    ]:
+        token = getattr(tokenizer, attr, None)
+        if token is None and hasattr(tokenizer, "_tokenizer"):
+            token = getattr(tokenizer._tokenizer, attr, None)
+        if token and isinstance(token, str) and token not in special_tokens:
+            special_tokens.append(token)
+
+    # If we found special tokens, test encoding text that contains them
+    if special_tokens:
+        # Create text with special tokens interspersed
+        test_with_special = f"{special_tokens[0]}Hello world"
+        if len(special_tokens) > 1:
+            test_with_special += f"{special_tokens[1]}"
+
+        encoded = tokenizer.encode(test_with_special)
+        assert isinstance(encoded, list), (
+            f"encode() with special tokens should return list for {model_id}"
+        )
+        assert len(encoded) > 0, (
+            f"encode() with special tokens should return non-empty list for {model_id}"
+        )
+        assert all(isinstance(t, int) for t in encoded), (
+            f"All tokens should be integers for {model_id}"
+        )
+
+        # Verify we can decode
+        decoded = tokenizer.decode(encoded)
+        assert isinstance(decoded, str), f"decode() should return string for {model_id}"
+
+    # Test with angle-bracket tokens (common format for special tokens)
+    # These should not raise errors even if they're not actual special tokens
+    angle_bracket_text = "<|test|>Hello<|end|>"
+    encoded = tokenizer.encode(angle_bracket_text)
+    assert isinstance(encoded, list), (
+        f"encode() with angle brackets should return list for {model_id}"
+    )
+    assert len(encoded) > 0, (
+        f"encode() with angle brackets should be non-empty for {model_id}"
+    )
+
+
+# Specifically test Kimi tokenizer since it has special handling
+@pytest.mark.asyncio
+async def test_kimi_tokenizer_specifically():
+    """Test Kimi tokenizer with its specific patches and quirks."""
+    kimi_models = [
+        (short_id, card)
+        for short_id, card in MODEL_CARDS.items()
+        if "kimi" in short_id.lower()
+    ]
+
+    if not kimi_models:
+        pytest.skip("No Kimi models found in MODEL_CARDS")
+
+    _, model_card = kimi_models[0]
+    model_id = str(model_card.model_id)
+
+    model_path = await download_tokenizer_files(model_id)
+
+    # Ensure the custom tokenizer file exists
+    if not (model_path / "tokenization_kimi.py").exists():
+        pytest.skip("tokenization_kimi.py not found")
+
+    tokenizer = load_tokenizer_for_model_id(model_id, model_path)
+    eos_token_ids = get_eos_token_ids_for_model(model_id)
+
+    # Test encode/decode cycle
+    test_text = "Hello, world!"
+    encoded = tokenizer.encode(test_text)
+    decoded = tokenizer.decode(encoded)
+
+    assert len(encoded) > 0, "Kimi tokenizer should encode text"
+    assert isinstance(decoded, str), "Kimi tokenizer should decode to string"
+
+    # Test that the patched encode works (returns list of ints)
+    assert all(isinstance(t, int) for t in encoded), "Tokens should be integers"
+
+    # Test encoding text with special tokens (like from chat templates)
+    # This is critical - the warmup inference uses prompts with special tokens
+    special_token_text = "<|im_user|>user<|im_middle|>Hello<|im_end|><|im_assistant|>"
+    special_encoded = tokenizer.encode(special_token_text)
+    assert len(special_encoded) > 0, "Kimi tokenizer should handle special tokens"
+    assert all(isinstance(t, int) for t in special_encoded), (
+        "Special token encoding should return integers"
+    )
+
+    # Verify EOS token is set
+    assert eos_token_ids == [163586], "Kimi EOS token should be [163586]"
+
+
+# Test GLM tokenizer since it also has special handling
+@pytest.mark.asyncio
+async def test_glm_tokenizer_specifically():
+    """Test GLM tokenizer with its specific EOS tokens."""
+    glm_models = [
+        (short_id, card)
+        for short_id, card in MODEL_CARDS.items()
+        if "glm" in short_id.lower()
+    ]
+
+    if not glm_models:
+        pytest.skip("No GLM models found in MODEL_CARDS")
+
+    _, model_card = glm_models[0]
+    model_id = str(model_card.model_id)
+
+    model_path = await download_tokenizer_files(model_id)
+
+    has_tokenizer = (model_path / "tokenizer.json").exists() or (
+        model_path / "tokenizer_config.json"
+    ).exists()
+    if not has_tokenizer:
+        pytest.skip("GLM tokenizer files not found")
+
+    tokenizer = load_tokenizer_for_model_id(model_id, model_path)
+    eos_token_ids = get_eos_token_ids_for_model(model_id)
+
+    # Test encode/decode
+    test_text = "Hello, world!"
+    encoded = tokenizer.encode(test_text)
+    decoded = tokenizer.decode(encoded)
+
+    assert len(encoded) > 0, "GLM tokenizer should encode text"
+    assert isinstance(decoded, str), "GLM tokenizer should decode to string"
+
+    # Verify EOS tokens
+    assert eos_token_ids == [
+        151336,
+        151329,
+        151338,
+    ], "GLM EOS tokens should be correct"
--- a/src/exo/worker/tests/unittests/test_plan/test_download_and_loading.py
+++ b/src/exo/worker/tests/unittests/test_plan/test_download_and_loading.py
@@ -1,5 +1,6 @@
 import exo.worker.plan as plan_mod
 from exo.shared.types.common import NodeId
+from exo.shared.types.memory import Memory
 from exo.shared.types.models import ModelId
 from exo.shared.types.tasks import LoadModel
 from exo.shared.types.worker.downloads import DownloadCompleted, DownloadProgress
@@ -94,13 +95,23 @@ def test_plan_loads_model_when_all_shards_downloaded_and_waiting():

    # Local node has already marked its shard as downloaded (not actually used by _load_model)
    local_download_status = {
-        MODEL_A_ID: DownloadCompleted(shard_metadata=shard1, node_id=NODE_A)
+        MODEL_A_ID: DownloadCompleted(
+            shard_metadata=shard1, node_id=NODE_A, total_bytes=Memory()
+        )
    }

    # Global view has completed downloads for both nodes
    global_download_status = {
-        NODE_A: [DownloadCompleted(shard_metadata=shard1, node_id=NODE_A)],
-        NODE_B: [DownloadCompleted(shard_metadata=shard2, node_id=NODE_B)],
+        NODE_A: [
+            DownloadCompleted(
+                shard_metadata=shard1, node_id=NODE_A, total_bytes=Memory()
+            )
+        ],
+        NODE_B: [
+            DownloadCompleted(
+                shard_metadata=shard2, node_id=NODE_B, total_bytes=Memory()
+            )
+        ],
    }

    result = plan_mod.plan(
@@ -140,7 +151,9 @@ def test_plan_does_not_request_download_when_shard_already_downloaded():

    # Local status claims the shard is downloaded already
    local_download_status = {
-        MODEL_A_ID: DownloadCompleted(shard_metadata=shard, node_id=NODE_A)
+        MODEL_A_ID: DownloadCompleted(
+            shard_metadata=shard, node_id=NODE_A, total_bytes=Memory()
+        )
    }

    # Global view hasn't caught up yet (no completed shards recorded for NODE_A)
@@ -192,10 +205,16 @@ def test_plan_does_not_load_model_until_all_shards_downloaded_globally():

    # Only NODE_A's shard is recorded as downloaded globally
    local_download_status = {
-        MODEL_A_ID: DownloadCompleted(shard_metadata=shard1, node_id=NODE_A)
+        MODEL_A_ID: DownloadCompleted(
+            shard_metadata=shard1, node_id=NODE_A, total_bytes=Memory()
+        )
    }
    global_download_status = {
-        NODE_A: [DownloadCompleted(shard_metadata=shard1, node_id=NODE_A)],
+        NODE_A: [
+            DownloadCompleted(
+                shard_metadata=shard1, node_id=NODE_A, total_bytes=Memory()
+            )
+        ],
        NODE_B: [],  # NODE_B has no downloads completed yet
    }

@@ -212,9 +231,15 @@ def test_plan_does_not_load_model_until_all_shards_downloaded_globally():
    assert result is None

    global_download_status = {
-        NODE_A: [DownloadCompleted(shard_metadata=shard1, node_id=NODE_A)],
+        NODE_A: [
+            DownloadCompleted(
+                shard_metadata=shard1, node_id=NODE_A, total_bytes=Memory()
+            )
+        ],
        NODE_B: [
-            DownloadCompleted(shard_metadata=shard2, node_id=NODE_B)
+            DownloadCompleted(
+                shard_metadata=shard2, node_id=NODE_B, total_bytes=Memory()
+            )
        ],  # NODE_B has no downloads completed yet
    }

--- a/tests/headless_runner.py
+++ b/tests/headless_runner.py
@@ -49,14 +49,12 @@ class Tests(BaseModel):
    kind: typing.Literal["init", "warmup", "inference"]


-hn = socket.gethostname()
 mp.set_start_method("spawn", force=True)
 logger_setup(None)


 async def main():
    logger.info("starting cool server majig")
-    logger.info(hn)
    await assert_downloads()
    cfg = Config()
    cfg.bind = "0.0.0.0:52415"
@@ -81,20 +79,35 @@ async def main():
 async def assert_downloads():
    sd = exo_shard_downloader()
    # await sd.ensure_shard(await build_full_shard(MODEL_CARDS["qwen3-0.6b"].model_id))
-    await sd.ensure_shard(await build_full_shard(MODEL_CARDS["llama-3.2-1b"].model_id))
+    await sd.ensure_shard(
+        await build_full_shard(MODEL_CARDS["llama-3.1-8b-bf16"].model_id)
+    )
+    await sd.ensure_shard(await build_full_shard(MODEL_CARDS["qwen3-30b"].model_id))
+    await sd.ensure_shard(
+        await build_full_shard(MODEL_CARDS["gpt-oss-120b-MXFP4-Q8"].model_id)
+    )
+    await sd.ensure_shard(
+        await build_full_shard(MODEL_CARDS["gpt-oss-20b-4bit"].model_id)
+    )


 async def ring_backend(test: Tests):
    iid = InstanceId(str(hash(str(test.devs))))
-    return await execute_test(test, ring_instance(test, iid))
+    weird_hn = socket.gethostname()
+    for dev in test.devs:
+        if weird_hn.startswith(dev[0]) or dev[0].startswith(weird_hn):
+            hn = dev[0]
+            break
+    else:
+        raise ValueError(f"{weird_hn} not in {test.devs}")
+    return await execute_test(test, ring_instance(test, iid, hn), hn)


-def ring_instance(test: Tests, iid: InstanceId) -> Instance:
-    global hn
+def ring_instance(test: Tests, iid: InstanceId, hn: str) -> Instance:
    hbn = [Host(ip="i dont care", port=52416) for _ in test.devs]
    world_size = len(test.devs)
    for i in range(world_size):
-        if hn.startswith(test.devs[i][0]):
+        if test.devs[i][0] == hn:
            hn = test.devs[i][0]
            if i - 1 >= 0:
                hbn[i - 1] = Host(ip=test.devs[i - 1][1], port=52416)
@@ -102,6 +115,8 @@ def ring_instance(test: Tests, iid: InstanceId) -> Instance:
                hbn[i + 1] = Host(ip=test.devs[i + 1][1], port=52416)
            hbn[i] = Host(ip="0.0.0.0", port=52416)
            break
+    else:
+        raise ValueError(f"{hn} not in {test.devs}")

    meta = MODEL_CARDS[test.model_id].metadata
    instance = MlxRingInstance(
@@ -131,10 +146,10 @@ def ring_instance(test: Tests, iid: InstanceId) -> Instance:
    return instance


-async def execute_test(test: Tests, instance: Instance):
+async def execute_test(test: Tests, instance: Instance, hn: str):
    world_size = len(test.devs)
    iid = InstanceId(str(hash(str(test.devs))))
-    _handle, recv, send = new_runner(instance)
+    _handle, recv, send = new_runner(instance, hn)
    if world_size > 1:
        send.send(ConnectToGroup(instance_id=iid))
    send.send(LoadModel(instance_id=iid))
@@ -181,17 +196,19 @@ async def execute_test(test: Tests, instance: Instance):

 async def jaccl_backend(test: Tests):
    iid = InstanceId(str(hash(str(test.devs))))
-    return await execute_test(test, jaccl_instance(test, iid))
+    weird_hn = socket.gethostname()
+    for dev in test.devs:
+        if weird_hn.startswith(dev[0]) or dev[0].startswith(weird_hn):
+            hn = dev[0]
+            break
+    else:
+        raise ValueError(f"{weird_hn} not in {test.devs}")
+    return await execute_test(test, jaccl_instance(test, iid, hn), hn)


-def jaccl_instance(test: Tests, iid: InstanceId):
-    global hn
+def jaccl_instance(test: Tests, iid: InstanceId, hn: str):
    meta = MODEL_CARDS[test.model_id].metadata
    world_size = len(test.devs)
-    for name, _ in test.devs:
-        if hn.startswith(name):
-            hn = name
-            break

    return MlxJacclInstance(
        instance_id=iid,
@@ -220,6 +237,7 @@ def jaccl_instance(test: Tests, iid: InstanceId):

 def new_runner(
    instance: Instance,
+    hn: str,
 ) -> tuple[mp.Process, MpReceiver[Event], MpSender[Task]]:
    bound_instance = BoundInstance(
        instance=instance, bound_runner_id=RunnerId(hn), bound_node_id=NodeId(hn)
--- a/tests/start_distributed_test.sh
+++ b/tests/start_distributed_test.sh
@@ -34,19 +34,23 @@ done
 devs_raw=$(printf "[\"%s\", \"%s\"], " "${weaved[@]}")
 devs="[${devs_raw%, }]"

-for i in "${!ips[@]}"; do  
-  { 
-    req="{
-      \"model_id\": \"llama-3.2-1b\",
-      \"devs\": ${devs},
-      \"kind\": \"inference\"
-     }"
-    echo "req $req"
-    curl -sN \
-      -X POST "http://${ips[$i]}:52415/${kind}" \
-      -H "Content-Type: application/json" -d "$req" \
-    2>&1 | sed "s/^/\n${hostnames[$i]}@${ips[$i]}: /" || echo "curl to ${hostnames[$i]} failed"
-  } &
+model_ids=("qwen3-30b" "gpt-oss-120b-MXFP4-Q8" "kimi-k2-thinking")
+
+for model_id in "${model_ids[@]}"; do
+  for i in "${!ips[@]}"; do  
+    { 
+      req="{
+        \"model_id\": \"${model_id}\",
+        \"devs\": ${devs},
+        \"kind\": \"inference\"
+       }"
+      echo "req $req"
+      curl -sN \
+        -X POST "http://${ips[$i]}:52415/${kind}" \
+        -H "Content-Type: application/json" -d "$req" \
+      2>&1 | sed "s/^/\n${hostnames[$i]}@${ips[$i]}: /" || echo "curl to ${hostnames[$i]} failed" && exit 1
+    } &
+  done
+  wait
 done

-wait
--- a/uv.lock
+++ b/uv.lock
Author	SHA1	Message	Date
Evan	bfebf1f2fe	maybe	2026-01-13 13:11:23 +00:00
Alex Cheema	e5e74e1eef	Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility (#1125 ) ## Motivation Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models. The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes. ## Changes ### Core Changes - mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes - transformers 5.x compatibility: Enable prerelease transformers dependency ### Kimi K2 Tokenizer Fixes - Add `bytes_to_unicode` monkey-patch to restore function moved in transformers 5.0.0rc2 - Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to bypass transformers 5.x bug with `auto_map` fallback - Patch `encode()` to use tiktoken directly with `allowed_special="all"` to handle special tokens from chat templates ### Other Changes - Dashboard: Show disk usage for completed model downloads - CI: Add `workflow_dispatch` trigger to build-app workflow - Docs: Add basic API documentation ### Testing - Add comprehensive tokenizer unit tests for all supported models - Tests verify encode/decode, special token handling, and chat template encoding ## Why It Works bytes_to_unicode issue: transformers 5.0.0rc2 moved `bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to `transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py` imports from the old location. The monkey-patch restores it at module load time. AutoTokenizer issue: transformers 5.x has a bug where `tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for custom tokenizers with `auto_map`. Loading the tokenizer directly bypasses this. encode() issue: transformers 5.x's `pad()` method fails for slow tokenizers. Using tiktoken's encode directly with `allowed_special="all"` avoids this path and properly handles special tokens like `<\|im_user\|>` from chat templates. ## Test Plan ### Manual Testing - Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21) - Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both nodes - Verified warmup inference completes successfully - Verified chat completions work with special tokens ### Automated Testing - Added `test_tokenizers.py` with 31 tests covering: - Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm) - Special token encoding (critical for chat templates) - Chat template application and encoding - Kimi-specific and GLM-specific edge cases - All tests pass: `uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py` ### Failing Tests RDMA with all models. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-01-13 12:06:04 +00:00
Jake Hillion	b968d6f0a0	ci: remove old commented out job	2026-01-13 12:42:04 +01:00
Jake Hillion	3bfffd9b4f	ci: build all Nix outputs on all platforms and push to cachix The CI was only running `nix flake check` on ubuntu-latest, missing builds for other platforms and not caching packages or devShells. Added a matrix-based `nix-build` job that runs on macos-26 (aarch64-darwin), ubuntu-latest (x86_64-linux), and ubuntu-24.04-arm (aarch64-linux). Each job enumerates all packages and devShells via `nix flake show --json`, builds them in a single `nix build` call for parallelization, then runs `nix flake check`. The cachix-action pushes all built outputs automatically. This ensures all Nix outputs are built and cached for every supported platform, speeding up local development and CI runs. Test plan: - Tested jq enumeration command locally, correctly outputs devShell paths - Verified xargs pipeline works with the enumerated outputs	2026-01-13 12:37:12 +01:00
Jake Hillion	007eb80029	nix: enable cachix Enable cachix and push to it in the pipeline.yml workflow. This won't cache a huge amount yet but will automatically extend our caching as we build more of the repo with Nix in CI. It can also be used by local users by accepting our cache to improve the speed of local builds. Test plan: - CI	2026-01-12 17:24:59 +01:00
Jake Hillion	8d7b6789b3	dashboard: show disk usage for completed models The downloads dashboard showed "Completed" for finished model downloads but provided no indication of how much disk space each model or the total models on a node were using. Added total_bytes field to DownloadCompleted type so the size is preserved when a download completes. Updated the dashboard to display the model size next to "Completed" status (e.g., "Completed (251.1GB)") and a total disk usage line below the model count for each node (e.g., "502.2GB on disk"). Test plan: - Ran unit tests for download apply and planning logic - Type checked all modified files with basedpyright	2026-01-12 16:34:29 +01:00
Jake Hillion	3c5b7ea670	ci: add workflow_dispatch trigger to build-app Build app is the most convenient way to get a DMG for testing, but currently it's a bit limited. You have to push to test-app every time which is far from ideal and requires a bit too much force pushing for my liking. Add the workflow_dispatch trigger. This adds a button in the actions UI to trigger a workflow for a named branch, which means you can use your normal dev branch instead of having to push to test-app. We'll leave that behaviour there for now too, though it may change in future. Filter on `"${{ github.event_name }}" == "workflow_dispatch"` and set those to alpha as well. Will verify by pushing the first version from `main` just in case. Unfortunately we do have to merge this before we can test it. Test plan: - Looking really hard.	2026-01-12 12:14:21 +01:00
PG	b74a610537	Add a basic documentation to the api interface (#1122 ) ## Motivation Adds basic api documentation ## Changes - Add docs/api.md - Modify README.md	2026-01-11 18:44:40 +00:00
Jake Hillion	18c4e49f91	nix: put treefmt in devshell treefmt is a useful to be able to access directly for some formatters like `jj fix`. Expose it in the devshell. Test plan: - Used with `jj fix` on a large branch. It worked.	2026-01-09 17:53:50 +01:00