removed animation

fix: resolve issue #1025
2026-01-18 02:50:24 -05:00 · 2026-01-06 07:07:03 +05:00 · 2026-01-02 10:00:52 +05:00
110 changed files with 6989 additions and 9189 deletions
--- a/.github/benchmark-dashboard/README.md
+++ b/.github/benchmark-dashboard/README.md
@@ -0,0 +1,159 @@
+# EXO Benchmark Dashboard
+
+A fully self-contained, browser-based dashboard for tracking EXO benchmark performance over time.
+
+## Features
+
+- 📊 **Success Rate Tracking**: Monitor cluster reliability across commits
+- ⚡ **Response Time Analysis**: Track average request completion times  
+- 🎯 **Throughput Metrics**: Tokens per second visualization
+- 📈 **Request Distribution**: Success/failure breakdown over time
+- 🔄 **Auto-Refresh**: Updates every 60 seconds
+- 📺 **TV-Ready**: Large, clear visualizations perfect for display
+- 🔐 **Secure**: Credentials stored in browser localStorage only
+- 🌐 **No Backend**: Directly accesses S3 from the browser
+
+## Quick Start
+
+### Option 1: Direct File Access (Simplest)
+
+Just open the HTML file directly in your browser:
+
+```bash
+open .github/benchmark-dashboard/index.html
+```
+
+Then click "Configure AWS Credentials" and enter your keys.
+
+### Option 2: URL Parameters (For Quick Setup)
+
+```bash
+# Serve with credentials in URL (they'll be moved to localStorage)
+open ".github/benchmark-dashboard/index.html?accessKey=YOUR_KEY&secretKey=YOUR_SECRET&region=us-east-1"
+```
+
+The credentials will be saved to localStorage and removed from the URL immediately.
+
+### Option 3: Simple HTTP Server
+
+```bash
+# From repo root
+python3 -m http.server 8080
+
+# Then open: http://localhost:8080/.github/benchmark-dashboard/
+```
+
+## AWS Credentials
+
+The dashboard needs read-only access to the `exo-benchmark-results` S3 bucket.
+
+### Required IAM Permissions
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:GetObject",
+        "s3:ListBucket"
+      ],
+      "Resource": [
+        "arn:aws:s3:::exo-benchmark-results",
+        "arn:aws:s3:::exo-benchmark-results/*"
+      ]
+    }
+  ]
+}
+```
+
+### Security Notes
+
+- ✅ Credentials stored in browser `localStorage` only
+- ✅ Never sent to any server (except AWS)
+- ✅ All S3 access happens client-side
+- ✅ Use read-only IAM credentials
+- ⚠️ Don't commit credentials to git
+- ⚠️ Use a dedicated read-only IAM user
+
+## TV/Kiosk Mode
+
+For permanent display on a TV:
+
+### macOS
+```bash
+open -a "Google Chrome" --args --kiosk ".github/benchmark-dashboard/index.html"
+```
+
+### Linux
+```bash
+chromium-browser --kiosk --app="file://$(pwd)/.github/benchmark-dashboard/index.html"
+```
+
+### Auto-start on Boot
+
+Create a simple startup script:
+
+```bash
+#!/bin/bash
+# /usr/local/bin/start-benchmark-dashboard.sh
+
+cd /path/to/exo
+python3 -m http.server 8080 &
+sleep 2
+chromium-browser --kiosk http://localhost:8080/.github/benchmark-dashboard/
+```
+
+## Data Displayed
+
+### Summary Cards
+- **Latest Success Rate**: Most recent benchmark success percentage with trend
+- **Avg Response Time**: Latest average response time in ms with trend
+- **Total Benchmarks**: Count of all benchmarks run
+- **Active Configurations**: Number of unique benchmark configs
+
+### Charts
+1. **Success Rate Over Time**: Line chart showing reliability trends
+2. **Average Response Time**: Performance over time (lower is better)
+3. **Throughput**: Tokens/second metric (higher is better)
+4. **Request Distribution**: Stacked bar chart of successes/failures
+
+## How It Works
+
+1. **Loads AWS SDK**: Uses AWS SDK for JavaScript (browser version)
+2. **Lists S3 Objects**: Fetches all files from `s3://exo-benchmark-results/bench/`
+3. **Downloads Results**: Fetches each JSON result file
+4. **Parses & Visualizes**: Uses Chart.js to create interactive charts
+5. **Auto-Refreshes**: Polls S3 every 60 seconds for new results
+
+## Customization
+
+To modify the dashboard:
+
+1. Edit `index.html` 
+2. Adjust `REFRESH_INTERVAL` for different polling frequency
+3. Modify chart colors/styles in the Chart.js configuration
+4. Add new metrics by extending the results parsing
+
+## Troubleshooting
+
+**"AWS credentials not configured"**
+- Click "Configure AWS Credentials" and enter your keys
+
+**"Error loading benchmark data"**
+- Check AWS credentials are correct
+- Verify S3 bucket name is `exo-benchmark-results`
+- Ensure IAM user has read permissions
+- Check browser console for detailed errors
+
+**"No benchmark results found"**
+- Wait for benchmark workflows to run
+- Verify results are being uploaded to S3
+- Check S3 bucket has files in `bench/` prefix
+
+**Charts not updating**
+- Check browser console for errors
+- Verify network connectivity to S3
+- Try refreshing the page manually
+
--- a/.github/benchmark-dashboard/index.html
+++ b/.github/benchmark-dashboard/index.html
--- a/.github/configs/README.md
+++ b/.github/configs/README.md
@@ -0,0 +1,186 @@
+# EXO Benchmark Configurations
+
+This directory contains configuration files for the EXO staged benchmark system.
+
+## Overview
+
+The staged benchmark system allows you to run complex, multi-stage load tests against EXO clusters. Each stage can have different characteristics:
+
+- **Prompt Length**: Number of tokens in the input prompt
+- **Generation Length**: Maximum tokens to generate in the response
+- **Time Between Requests**: Delay (in seconds) between firing consecutive requests
+- **Iterations**: Number of requests to send in this stage
+
+Requests are **fire-and-forget** - they don't wait for the previous request to complete. This allows you to test overlapping request handling and measure success rates under load.
+
+## Configuration Files
+
+### `bench_simple.yaml`
+A minimal configuration that replicates the behavior of the original `bench.py` script:
+- Single stage with 1 iteration
+- Short prompt (~20 tokens)
+- Generates up to 100 tokens
+
+This is useful for quick smoke tests.
+
+### `bench_config.yaml`
+A comprehensive multi-stage benchmark with:
+1. **Warmup** (10 requests): Light load with short prompts
+2. **Medium Load** (20 requests): Moderate load with medium prompts
+3. **Stress Test** (30 requests): Heavy overlapping requests with long prompts
+4. **Cooldown** (5 requests): Light load to wind down
+
+This tests the cluster's behavior under varying load patterns.
+
+## Configuration Schema
+
+```yaml
+# Hardware configuration - maps runner labels to instance counts
+hardware_plan:
+  M3ULTRA_GPU80_512GB: 4
+
+# Environment variables to set on each node (optional)
+environment:
+  OVERRIDE_MEMORY_MB: 512
+
+# Timeout for instance and runner readiness (seconds)
+timeout_seconds: 600
+
+# Model instances to run concurrently
+model_ids:
+  - "mlx-community/Llama-3.2-1B-Instruct-4bit"
+
+# Benchmark stages
+stages:
+  - name: "stage_name"              # Human-readable name for this stage
+    prompt_length: 100               # Target prompt length in tokens
+    generation_length: 200           # Max tokens to generate
+    time_between_requests: 2.0       # Seconds between firing requests
+    iterations: 10                   # Number of requests in this stage
+```
+
+## Running Benchmarks
+
+### Via GitHub Actions
+
+**Automatic (every commit):**
+- The **`bench`** workflow runs automatically on every push
+- Uses `bench_simple.yaml` as the default configuration
+- All settings (hardware plan, timeout, environment variables, models, stages) are defined in the config file
+
+**Manual (on-demand):**
+1. Go to **Actions** → **bench** workflow
+2. Click **Run workflow**
+3. Configure:
+   - **Config File**: Path to your YAML config (default: `.github/configs/bench_simple.yaml`)
+     - `.github/configs/bench_simple.yaml` for quick tests
+     - `.github/configs/bench_config.yaml` for complex multi-stage tests
+   
+All other settings (hardware plan, timeout, environment variables, models, stages) are read from the specified config file.
+
+### Via Command Line
+
+```bash
+# Start EXO on localhost:8000
+uv run exo --api-port 8000
+
+# Run simple benchmark (1 stage, 1 iteration)
+python3 .github/scripts/bench.py \
+  --api-port 8000 \
+  --config .github/configs/bench_simple.yaml \
+  --expected-nodes 1 \
+  --is-primary true \
+  --timeout-seconds 600
+
+# Run complex staged benchmark (4 stages, multiple iterations)
+python3 .github/scripts/bench.py \
+  --api-port 8000 \
+  --config .github/configs/bench_config.yaml \
+  --expected-nodes 1 \
+  --is-primary true \
+  --timeout-seconds 600
+```
+
+## Output Metrics
+
+For each stage, the benchmark reports:
+
+- **Total Requests**: Number of requests fired
+- **Successful Requests**: Requests that completed successfully
+- **Failed Requests**: Requests that encountered errors
+- **Success Rate**: Percentage of successful requests
+- **Total Tokens**: Sum of all tokens generated across successful requests
+- **Avg Tokens/Request**: Average tokens per successful request
+- **Avg Time/Request**: Average completion time per successful request
+
+A JSON summary is also printed for easy parsing and storage.
+
+## Creating Custom Benchmarks
+
+To create a custom benchmark:
+
+1. Copy an existing config file (e.g., `bench_config.yaml`)
+2. Modify the stages to match your test scenario
+3. Save it in this directory with a descriptive name
+4. Run it using the workflow or command line
+
+### Example: Sustained Load Test
+
+```yaml
+hardware_plan:
+  M3ULTRA_GPU80_512GB: 2
+
+environment:
+  OVERRIDE_MEMORY_MB: 1024
+
+timeout_seconds: 600
+
+model_ids:
+  - "mlx-community/Llama-3.2-1B-Instruct-4bit"
+
+stages:
+  - name: "sustained_load"
+    prompt_length: 200
+    generation_length: 150
+    time_between_requests: 0.5     # Very fast - 2 requests/second
+    iterations: 100                 # Run for ~50 seconds
+```
+
+### Example: Varying Prompt Sizes
+
+```yaml
+hardware_plan:
+  M4PRO_GPU16_24GB: 3
+
+timeout_seconds: 900
+
+model_ids:
+  - "mlx-community/Llama-3.2-1B-Instruct-4bit"
+
+stages:
+  - name: "tiny_prompts"
+    prompt_length: 10
+    generation_length: 100
+    time_between_requests: 1.0
+    iterations: 10
+    
+  - name: "medium_prompts"
+    prompt_length: 200
+    generation_length: 100
+    time_between_requests: 1.0
+    iterations: 10
+    
+  - name: "large_prompts"
+    prompt_length: 1000
+    generation_length: 100
+    time_between_requests: 1.0
+    iterations: 10
+```
+
+## Tips
+
+- **Overlapping Requests**: Set `time_between_requests` < expected completion time to test concurrent request handling
+- **Sequential Requests**: Set `time_between_requests` > expected completion time to ensure requests don't overlap
+- **Realistic Load**: Model real usage patterns by varying prompt/generation lengths across stages
+- **Success Rate**: A 100% success rate indicates the cluster handled the load well; lower rates suggest capacity limits
+
--- a/.github/configs/bench_config.yaml
+++ b/.github/configs/bench_config.yaml
@@ -0,0 +1,49 @@
+# EXO Staged Benchmark Configuration
+# This configuration defines a multi-stage load test for EXO clusters
+
+# Hardware configuration - maps runner labels to instance counts
+hardware_plan:
+  M3ULTRA_GPU80_512GB: 4
+
+# Environment variables to set on each node (optional)
+environment:
+  OVERRIDE_MEMORY_MB: 512
+
+# Timeout for instance and runner readiness (seconds)
+timeout_seconds: 600
+
+# Multiple instances run concurrently on the cluster
+model_ids:
+  - "mlx-community/Qwen3-0.6B-4bit"
+  - "mlx-community/Qwen3-0.6B-4bit"
+
+# Stages run sequentially, each with its own characteristics
+stages:
+  # Stage 1: Light load with short prompts
+  - name: "warmup"
+    prompt_length: 50          # Number of tokens in prompt
+    generation_length: 100     # Max tokens to generate
+    time_between_requests: 5.0 # Seconds between firing requests
+    iterations: 10             # Number of requests to send in this stage
+    
+  # Stage 2: Medium load with medium prompts
+  - name: "medium_load"
+    prompt_length: 200
+    generation_length: 150
+    time_between_requests: 3.0
+    iterations: 20
+    
+  # Stage 3: Heavy load with long prompts - requests will overlap
+  - name: "stress_test"
+    prompt_length: 500
+    generation_length: 200
+    time_between_requests: 1.0  # Fast firing - will definitely overlap
+    iterations: 30
+    
+  # Stage 4: Cool down with simple prompts
+  - name: "cooldown"
+    prompt_length: 50
+    generation_length: 50
+    time_between_requests: 10.0
+    iterations: 5
+
--- a/.github/configs/bench_simple.yaml
+++ b/.github/configs/bench_simple.yaml
@@ -0,0 +1,125 @@
+# Simple single-shot benchmark
+# Tests 2 instances concurrently on 2 nodes
+
+# Hardware configuration - maps runner labels to instance counts
+hardware_plan:
+  puffin4: 1
+  puffin8: 1
+
+# Environment variables to set on each node
+environment:
+  PLACEHOLDER: "placeholder"
+  # OVERRIDE_MEMORY_MB: 50000
+  MLX_METAL_FAST_SYNCH: 1
+
+# Timeout for instance and runner readiness (seconds)
+timeout_seconds: 1800
+
+# Model instances to run concurrently
+model_ids:
+  # - "mlx-community/DeepSeek-V3.1-8bit"
+  # - "mlx-community/Kimi-K2-Instruct-4bit"
+  - "mlx-community/Kimi-K2-Thinking"
+  # - "mlx-community/Qwen3-235B-A22B-4bit"
+  # - "mlx-community/Llama-3.3-70B-Instruct-4bit"
+  # - "mlx-community/Llama-3.3-70B-Instruct-8bit"
+  # - "mlx-community/Llama-3.2-1B-Instruct-4bit"
+
+# Sharding strategy: "Pipeline" or "Tensor"
+sharding: "Tensor"
+
+# Instance type: "MlxRing" or "MlxIbv"
+instance_meta: "MlxIbv"
+
+# If true, run requests sequentially (no overlap); if false, fire-and-forget (default: false)
+no_overlap: true
+
+# Benchmark stages
+# pp: 64, 256, 1024, 2048, 4096, 8192, 16384
+# g: 64, 512
+stages:
+  # - name: "simple"
+  #   prompt_length: 512
+  #   generation_length: 10
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  # - name: "pp64_g64"
+  #   prompt_length: 64
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  # - name: "pp64_g64"
+  #   prompt_length: 64
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  # - name: "pp64_g512"
+  #   prompt_length: 64
+  #   generation_length: 512
+  #   time_between_requests: 2.0
+  #   iterations: 10
+  # - name: "pp256_g64"
+  #   prompt_length: 256
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  - name: "pp256_g64"
+    prompt_length: 256
+    generation_length: 64
+    time_between_requests: 2.0
+    iterations: 5
+  # - name: "pp256_g512"
+  #   prompt_length: 256
+  #   generation_length: 512
+  #   time_between_requests: 2.0
+  #   iterations: 10
+  # - name: "pp1024_g64"
+  #   prompt_length: 1024
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  # - name: "pp1024_g512"
+  #   prompt_length: 1024
+  #   generation_length: 512
+  #   time_between_requests: 2.0
+  #   iterations: 10
+  # - name: "pp2048_g64"
+  #   prompt_length: 2048
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  # - name: "pp2048_g512"
+  #   prompt_length: 2048
+  #   generation_length: 512
+  #   time_between_requests: 2.0
+  #   iterations: 10
+  # - name: "pp4096_g64"
+  #   prompt_length: 4096
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 4
+  # - name: "pp4096_g512"
+  #   prompt_length: 4096
+  #   generation_length: 512
+  #   time_between_requests: 2.0
+  #   iterations: 10
+  # - name: "pp8192_g64"
+  #   prompt_length: 8192
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  # - name: "pp8192_g512"
+  #   prompt_length: 8192
+  #   generation_length: 512
+  #   time_between_requests: 2.0
+  #   iterations: 5
+  # - name: "pp16384_g64"
+  #   prompt_length: 16384
+  #   generation_length: 64
+  #   time_between_requests: 2.0
+  #   iterations: 10
+  # - name: "pp16384_g512"
+  #   prompt_length: 16384
+  #   generation_length: 512
+  #   time_between_requests: 2.0
+  #   iterations: 10
--- a/.github/scripts/bench.py
+++ b/.github/scripts/bench.py
--- a/.github/scripts/build_matrix.py
+++ b/.github/scripts/build_matrix.py
@@ -0,0 +1,70 @@
+#!/usr/bin/env python3
+import json
+import os
+from typing import NotRequired, TypedDict, cast
+
+import yaml
+
+
+class MatrixEntry(TypedDict):
+    label: str
+    index: int
+
+
+class MatrixInclude(TypedDict):
+    label: str
+    index: int
+    is_primary: bool
+    expected_nodes: int
+
+
+class Config(TypedDict):
+    hardware_plan: dict[str, int]
+    timeout_seconds: NotRequired[int]
+    environment: NotRequired[dict[str, str]]
+
+
+# Read the config file
+config_file: str = os.environ["CONFIG_FILE"]
+with open(config_file, "r") as f:
+    config: Config = cast(Config, yaml.safe_load(f))
+
+# Extract hardware plan from config
+plan: dict[str, int] = config["hardware_plan"]
+if not plan:
+    raise ValueError(f"No hardware_plan found in {config_file}")
+
+# Build matrix entries
+entries: list[MatrixEntry] = []
+for label, count in plan.items():
+    for idx in range(count):
+        entries.append({"label": label, "index": idx})
+
+total_nodes: int = len(entries)
+matrix: dict[str, list[MatrixInclude]] = {
+    "include": [
+        {
+            "label": e["label"],
+            "index": e["index"],
+            "is_primary": (i == 0),
+            "expected_nodes": total_nodes,
+        }
+        for i, e in enumerate(entries)
+    ]
+}
+
+# Extract other config values
+timeout_seconds: int = config.get("timeout_seconds", 600)
+environment: dict[str, str] = config.get("environment", {})
+
+# Output to GitHub Actions
+with open(os.environ["GITHUB_OUTPUT"], "a") as f:
+    f.write(f"matrix={json.dumps(matrix)}\n")
+    f.write(f"config_file={config_file}\n")
+    f.write(f"timeout_seconds={timeout_seconds}\n")
+    f.write(f"environment={json.dumps(environment)}\n")
+
+print(f"Matrix: {json.dumps(matrix)}")
+print(f"Config file: {config_file}")
+print(f"Timeout: {timeout_seconds}")
+print(f"Environment: {json.dumps(environment)}")
--- a/.github/workflows/BENCH_USAGE.md
+++ b/.github/workflows/BENCH_USAGE.md
@@ -0,0 +1,156 @@
+# Benchmark Workflow Usage
+
+## Overview
+
+The `bench_matrix.yml` workflow enables distributed benchmarking of models across multiple self-hosted macOS runners with different hardware configurations.
+
+## Workflow Inputs
+
+| Input | Description | Default | Required |
+|-------|-------------|---------|----------|
+| `model_id` | Model ID to benchmark | `mlx-community/Llama-3.2-1B-Instruct-4bit` | Yes |
+| `hardware_plan` | JSON mapping of runner labels to counts | `{"M4PRO_GPU16_24GB": 1}` | Yes |
+| `prompt` | Benchmark prompt text | `What is the capital of France?` | No |
+| `timeout_seconds` | Timeout for instance/runner readiness | `600` | No |
+
+## Hardware Plan Format
+
+The `hardware_plan` input is a JSON object mapping runner labels to the number of machines:
+
+```json
+{
+  "M4PRO_GPU16_24GB": 2,
+  "M3ULTRA_GPU80_512GB": 1
+}
+```
+
+This example would:
+- Start 2 runners with the `M4PRO_GPU16_24GB` label
+- Start 1 runner with the `M3ULTRA_GPU80_512GB` label
+- Total of 3 runners coordinating on a single distributed inference instance
+
+## How It Works
+
+1. **Planning Job** (`plan`)
+   - Runs on `ubuntu-latest`
+   - Parses the `hardware_plan` JSON
+   - Generates a dynamic matrix with one entry per runner
+   - Only the first runner (index 0) is marked as `is_primary`
+
+2. **Benchmark Worker Jobs** (`bench_worker`)
+   - Each job runs on a self-hosted macOS runner with the specified label
+   - All runners start EXO in parallel
+   - The primary runner creates the model instance
+   - All runners wait for their assigned runner to be ready (Loaded/Running status)
+   - The primary runner executes the benchmark and prints results
+   - The primary runner deletes the instance
+
+## Example Usage
+
+### Single Machine Benchmark
+
+```yaml
+model_id: mlx-community/Llama-3.2-1B-Instruct-4bit
+hardware_plan: '{"M4PRO_GPU16_24GB": 1}'
+prompt: What is the capital of France?
+timeout_seconds: 600
+```
+
+### Multi-Machine Distributed Benchmark
+
+```yaml
+model_id: mlx-community/Llama-3.2-3B-Instruct-4bit
+hardware_plan: '{"M4PRO_GPU16_24GB": 2, "M3ULTRA_GPU80_512GB": 1}'
+prompt: Explain quantum computing in simple terms.
+timeout_seconds: 900
+```
+
+## Benchmark Output
+
+The primary runner outputs a JSON object with benchmark results:
+
+```json
+{
+  "model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit",
+  "instance_id": "abc-123-def",
+  "tokens": 42,
+  "elapsed_s": 2.451,
+  "tps": 17.136
+}
+```
+
+Where:
+- `tokens`: Number of chunks/tokens generated
+- `elapsed_s`: Total elapsed time in seconds
+- `tps`: Tokens per second (tokens / elapsed_s)
+
+## Runner Requirements
+
+Each self-hosted runner must:
+- Be labeled with appropriate hardware tags (e.g., `M4PRO_GPU16_24GB`)
+- Have the `self-hosted` and `macOS` labels
+- Have Nix installed with flakes enabled
+- Have network connectivity to other runners in the same job
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ GitHub Actions Workflow (bench_matrix.yml)                  │
+├─────────────────────────────────────────────────────────────┤
+│                                                              │
+│  ┌────────────────┐                                         │
+│  │  Plan Job      │                                         │
+│  │  (ubuntu)      │──┬─► Matrix: [{label, index, primary}] │
+│  └────────────────┘  │                                      │
+│                      │                                      │
+│  ┌───────────────────▼──────────────────────────────────┐  │
+│  │  Bench Worker Jobs (Matrix)                         │  │
+│  ├──────────────────────────────────────────────────────┤  │
+│  │                                                       │  │
+│  │  Runner 0 (Primary)     Runner 1         Runner 2    │  │
+│  │  ┌─────────────┐       ┌─────────────┐ ┌──────────┐ │  │
+│  │  │ Start EXO   │       │ Start EXO   │ │ Start EXO│ │  │
+│  │  │ Create Inst │       │ Wait...     │ │ Wait...  │ │  │
+│  │  │ Wait Ready  │       │ Wait Ready  │ │ Wait...  │ │  │
+│  │  │ Run Bench   │       │ (idle)      │ │ (idle)   │ │  │
+│  │  │ Print TPS   │       │             │ │          │ │  │
+│  │  │ Delete Inst │       │             │ │          │ │  │
+│  │  └─────────────┘       └─────────────┘ └──────────┘ │  │
+│  └───────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Implementation Details
+
+### `scripts/bench.py`
+
+A standalone Python script that:
+- Creates instance (primary only)
+- Polls `/state` endpoint until instance and all runners are ready
+- Executes chat completion with timing (primary only)
+- Parses SSE stream and counts tokens
+- Computes TPS metrics
+- Cleans up instance (primary only)
+
+### Key Functions
+
+- `wait_for_instance()`: Polls until instance with model_id appears
+- `wait_for_runners_ready()`: Polls until expected number of runners reach Loaded/Running status
+- `run_benchmark()`: Executes chat completion, measures time, counts tokens
+
+## Troubleshooting
+
+### Instance never becomes ready
+- Check EXO logs in the workflow output
+- Verify model_id is valid and accessible
+- Increase `timeout_seconds`
+
+### Runner mismatch
+- Ensure hardware_plan counts match available labeled runners
+- Check runner labels match exactly (case-sensitive)
+
+### Network issues
+- Verify runners can communicate on the network
+- Check firewall rules between runner hosts
+
--- a/.github/workflows/bench.yml
+++ b/.github/workflows/bench.yml
@@ -0,0 +1,305 @@
+name: bench
+
+on: [push]
+
+jobs:
+  plan:
+    if: contains(github.event.head_commit.message, '/bench')
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.build.outputs.matrix }}
+      config_file: ${{ steps.build.outputs.config_file }}
+      timeout_seconds: ${{ steps.build.outputs.timeout_seconds }}
+      environment: ${{ steps.build.outputs.environment }}
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Build matrix from config file
+        id: build
+        shell: bash
+        run: |
+          set -euo pipefail
+          CONFIG_FILE='.github/configs/bench_simple.yaml'
+          export CONFIG_FILE
+          echo "Config file: $CONFIG_FILE"
+          python3 .github/scripts/build_matrix.py
+
+  bench_worker:
+    needs: plan
+    strategy:
+      fail-fast: false
+      matrix: ${{ fromJSON(needs.plan.outputs.matrix) }}
+    name: "bench on ${{ matrix.label }} [${{ matrix.index }}]"
+    runs-on: [self-hosted, macOS, "${{ matrix.label }}"]
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          lfs: false
+
+      - name: Configure git user
+        run: |
+          git config --local user.email "github-actions@users.noreply.github.com"
+          git config --local user.name  "github-actions bot"
+        shell: bash
+
+      # TODO: this is mega hacky and I'd like a simpler solution.
+      - name: Setup Nix Environment
+        run: |
+          echo "Checking for nix installation..."
+          
+          # Check if nix is already available
+          if command -v nix >/dev/null 2>&1; then
+            echo "Nix already in PATH"
+          # Try sourcing profile scripts to set up environment properly
+          elif [ -f /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh ]; then
+            echo "Sourcing multi-user nix-daemon profile script"
+            source /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh
+          elif [ -f "$HOME/.nix-profile/etc/profile.d/nix.sh" ]; then
+            echo "Sourcing single-user nix profile script"
+            source "$HOME/.nix-profile/etc/profile.d/nix.sh"
+          elif [ -f /nix/var/nix/profiles/per-user/$USER/profile/etc/profile.d/nix.sh ]; then
+            echo "Sourcing per-user nix profile script"
+            source /nix/var/nix/profiles/per-user/$USER/profile/etc/profile.d/nix.sh
+          elif [ -f /etc/profile.d/nix.sh ]; then
+            echo "Sourcing system-wide nix profile script"
+            source /etc/profile.d/nix.sh
+          # Fallback: manually add nix to PATH if binary exists
+          elif [ -f /nix/var/nix/profiles/default/bin/nix ]; then
+            echo "Found nix binary, manually adding to PATH"
+            export PATH="/nix/var/nix/profiles/default/bin:$PATH"
+          elif [ -f "$HOME/.nix-profile/bin/nix" ]; then
+            echo "Found nix binary in user profile, manually adding to PATH"
+            export PATH="$HOME/.nix-profile/bin:$PATH"
+          else
+            echo "Nix not found. Debugging info:"
+            echo "USER: $USER"
+            echo "HOME: $HOME"
+            echo "Current PATH: $PATH"
+            echo ""
+            echo "Checking common Nix locations:"
+            echo "  /nix/var/nix/profiles/default/bin/nix:"
+            ls -la /nix/var/nix/profiles/default/bin/nix 2>/dev/null || echo "    Not found"
+            echo "  /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh:"
+            ls -la /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh 2>/dev/null || echo "    Not found"
+            echo "  ~/.nix-profile/etc/profile.d/nix.sh:"
+            ls -la "$HOME/.nix-profile/etc/profile.d/nix.sh" 2>/dev/null || echo "    Not found"
+            echo "  /nix/var/nix/profiles/per-user/$USER/profile/etc/profile.d/nix.sh:"
+            ls -la "/nix/var/nix/profiles/per-user/$USER/profile/etc/profile.d/nix.sh" 2>/dev/null || echo "    Not found"
+            echo ""
+            echo "/nix directory structure:"
+            ls -la /nix 2>/dev/null || echo "    /nix directory not found"
+            echo ""
+            echo "/nix/var:"
+            ls -la /nix/var 2>/dev/null || echo "    /nix/var not found"
+            echo ""
+            echo "/nix/store:"
+            ls -la /nix/store 2>/dev/null | head -20 || echo "    /nix/store not found"
+            echo ""
+            echo "GitHub Actions runner is running as user '$USER'."
+            echo "If Nix is installed for a different user, either:"
+            echo "  1. Install Nix for user '$USER' (multi-user install recommended)"
+            echo "  2. Configure the runner service to run as the user with Nix installed"
+            echo "  3. Ensure Nix is installed system-wide with proper daemon setup"
+            exit 1
+          fi
+          
+          # Verify nix is available and persist to GITHUB_ENV
+          if command -v nix >/dev/null 2>&1; then
+            echo "✓ Nix is available"
+            nix --version
+            echo "PATH=$PATH" >> $GITHUB_ENV
+            if [ -n "$NIX_PATH" ]; then
+              echo "NIX_PATH=$NIX_PATH" >> $GITHUB_ENV
+            fi
+          else
+            echo "ERROR: Failed to set up Nix"
+            echo "PATH after setup attempt: $PATH"
+            exit 1
+          fi
+        shell: bash
+
+      - name: Setup EXO_HOME and API_PORT
+        run: |
+          EXO_HOME=$(mktemp -d -t exo-e2e-XXXXXXXX)
+          API_PORT=$((49152 + RANDOM % (65535 - 49152 + 1)))
+          EXO_MODELS_DIR="$HOME/.exo/models"
+          EXO_LIBP2P_NAMESPACE="bench-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}"
+          echo "EXO_HOME=$EXO_HOME" >> "$GITHUB_ENV"
+          echo "API_PORT=$API_PORT" >> "$GITHUB_ENV"
+          echo "EXO_MODELS_DIR=$EXO_MODELS_DIR" >> "$GITHUB_ENV"
+          echo "EXO_LIBP2P_NAMESPACE=$EXO_LIBP2P_NAMESPACE" >> "$GITHUB_ENV"
+          echo "Created EXO_HOME: $EXO_HOME"
+          echo "Generated API_PORT: $API_PORT"
+          echo "Using models from: $EXO_MODELS_DIR"
+          echo "Using libp2p namespace: $EXO_LIBP2P_NAMESPACE"
+        shell: bash
+
+      - name: Configure local MLX if available
+        run: |
+          echo "=== DEBUG: Checking for local MLX configuration ==="
+          MODIFIED=false
+          
+          echo "Checking for /Users/Shared/mlx directory..."
+          if [ -d "/Users/Shared/mlx" ]; then
+            echo "✓ Found /Users/Shared/mlx"
+            ls -la /Users/Shared/mlx | head -5
+            echo "Enabling local mlx path in pyproject.toml"
+            sed -i.bak 's|^# mlx = { path = "/Users/Shared/mlx", editable=true }$|mlx = { path = "/Users/Shared/mlx", editable=true }|' pyproject.toml
+            MODIFIED=true
+          else
+            echo "✗ /Users/Shared/mlx not found, will use PyPI version"
+          fi
+          
+          echo "Checking for /Users/Shared/mlx-lm directory..."
+          if [ -d "/Users/Shared/mlx-lm" ]; then
+            echo "✓ Found /Users/Shared/mlx-lm"
+            ls -la /Users/Shared/mlx-lm | head -5
+            echo "Enabling local mlx-lm path in pyproject.toml"
+            sed -i.bak 's|^# mlx-lm = { path = "/Users/Shared/mlx-lm", editable=true }$|mlx-lm = { path = "/Users/Shared/mlx-lm", editable=true }|' pyproject.toml
+            MODIFIED=true
+          else
+            echo "✗ /Users/Shared/mlx-lm not found, will use PyPI version"
+          fi
+          
+          if [ "$MODIFIED" = true ]; then
+            echo "=== Modified pyproject.toml [tool.uv.sources] section: ==="
+            sed -n '/\[tool\.uv\.sources\]/,/^\[/{/^\[tool\.uv\.sources\]/p; /^\[/!p;}' pyproject.toml
+            echo "=== Regenerating uv.lock with local MLX paths... ==="
+            nix --extra-experimental-features nix-command --extra-experimental-features flakes develop --command uv lock --upgrade-package mlx --upgrade-package mlx-lm
+            echo "✓ Lock file regenerated"
+          else
+            echo "⚠ No local MLX directories found, using PyPI packages"
+          fi
+          echo "=== DEBUG: Local MLX configuration complete ==="
+        shell: bash
+
+      - name: Sync dependencies
+        run: |
+          if [ -d "/Users/Shared/test" ]; then
+            pushd /Users/Shared/test
+            uv sync --reinstall
+            popd
+          fi
+          echo "Running just sync to ensure clean dependencies..."
+          nix --extra-experimental-features nix-command --extra-experimental-features flakes develop --command just sync
+        shell: bash
+
+      - name: Start EXO and run bench script
+        shell: bash
+        env:
+          IS_PRIMARY: ${{ matrix.is_primary }}
+          EXPECTED_NODES: ${{ matrix.expected_nodes }}
+          HARDWARE_LABEL: ${{ matrix.label }}
+          CONFIG_FILE: ${{ needs.plan.outputs.config_file }}
+          TIMEOUT_SECONDS: ${{ needs.plan.outputs.timeout_seconds }}
+          ENVIRONMENT_JSON: ${{ needs.plan.outputs.environment }}
+        run: |
+          set -euo pipefail
+
+          # Parse environment variables from config
+          ENV_VARS=""
+          if [ -n "$ENVIRONMENT_JSON" ] && [ "$ENVIRONMENT_JSON" != "{}" ]; then
+            ENV_VARS=$(echo "$ENVIRONMENT_JSON" | python3 -c "import sys, json; env = json.load(sys.stdin); print(' '.join([f'{k}={v}' for k, v in env.items()]))")
+          fi
+
+          echo "Starting EXO with API_PORT=${API_PORT} EXO_HOME=${EXO_HOME} EXO_LIBP2P_NAMESPACE=${EXO_LIBP2P_NAMESPACE}"
+          echo "Environment variables from config: $ENV_VARS"
+          LOG_FILE=/tmp/exo.log
+          : > "$LOG_FILE"
+
+          MASTER_FLAG=""
+          if [ "$IS_PRIMARY" = "true" ]; then
+            MASTER_FLAG="-m"
+          fi
+
+          nix --extra-experimental-features nix-command --extra-experimental-features flakes develop --command bash -c \
+            "EXO_HOME=$EXO_HOME EXO_MODELS_DIR=$EXO_MODELS_DIR EXO_LIBP2P_NAMESPACE=$EXO_LIBP2P_NAMESPACE $ENV_VARS PYTHONUNBUFFERED=1 PYTHONDEBUG=1 PYTHONPATH=. uv run exo $MASTER_FLAG --api-port $API_PORT" \
+            >> "$LOG_FILE" 2>&1 &
+
+          EXO_PID=$!
+          echo "Started EXO in background with PID: $EXO_PID"
+          echo "Log file: $LOG_FILE"
+
+          cleanup() {
+            echo '=== EXO log (tail) ==='
+            tail -n 300 "$LOG_FILE" || true
+            if ps -p "$EXO_PID" >/dev/null 2>&1; then
+              echo "Killing EXO (PID $EXO_PID)"
+              kill "$EXO_PID" || true
+            fi
+          }
+          trap cleanup EXIT
+
+          for i in $(seq 1 60); do
+            if curl -s "http://localhost:${API_PORT}/state" >/dev/null 2>&1; then
+              echo "EXO API ready"
+              break
+            fi
+            if ! ps -p "$EXO_PID" >/dev/null 2>&1; then
+              echo "EXO terminated early"; sed -n '1,200p' "$LOG_FILE" || true; exit 1
+            fi
+            sleep 1
+          done
+
+          RESULTS_FILE="/tmp/bench_results_${GITHUB_RUN_ID}_${GITHUB_RUN_ATTEMPT}_$(date +%s).json"
+          echo "Results will be saved to: $RESULTS_FILE"
+          echo "RESULTS_FILE=$RESULTS_FILE" >> "$GITHUB_ENV"
+
+          echo "Running bench script with config: $CONFIG_FILE, timeout: $TIMEOUT_SECONDS"
+          nix --extra-experimental-features nix-command --extra-experimental-features flakes develop --command bash -c \
+            "PYTHONUNBUFFERED=1 uv run --no-project --with pyyaml --with pydantic python .github/scripts/bench.py \
+              --api-port $API_PORT \
+              --config $CONFIG_FILE \
+              --expected-nodes ${EXPECTED_NODES} \
+              --is-primary ${IS_PRIMARY} \
+              --timeout-seconds ${TIMEOUT_SECONDS} \
+              --output $RESULTS_FILE \
+              --git-commit ${GITHUB_SHA} \
+              --hardware-labels ${HARDWARE_LABEL}"
+
+      - name: Install AWS CLI
+        if: always() && env.RESULTS_FILE && matrix.is_primary
+        run: |
+          if ! command -v aws &> /dev/null; then
+            echo "AWS CLI not found, installing..."
+            brew install awscli
+          else
+            echo "AWS CLI already installed"
+          fi
+        shell: bash
+
+      - name: Upload results to S3
+        if: always() && env.RESULTS_FILE && matrix.is_primary
+        env:
+          AWS_ACCESS_KEY_ID: ${{ secrets.S3_BENCHMARKS_AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.S3_BENCHMARKS_AWS_SECRET_ACCESS_KEY }}
+          AWS_DEFAULT_REGION: us-east-1
+        run: |
+          echo "Checking for results file: $RESULTS_FILE"
+          echo "Is primary: ${{ matrix.is_primary }}"
+
+          if [ -f "$RESULTS_FILE" ]; then
+            TIMESTAMP=$(date -u +%Y/%m/%d/%H%M%S)
+            S3_KEY="bench/${TIMESTAMP}_${GITHUB_SHA:0:8}_${GITHUB_RUN_ID}.json"
+            echo "Uploading results to s3://exo-benchmark-results/$S3_KEY"
+
+            aws s3 cp "$RESULTS_FILE" "s3://exo-benchmark-results/$S3_KEY" \
+              --content-type application/json \
+              --metadata "commit=${GITHUB_SHA},run_id=${GITHUB_RUN_ID},branch=${GITHUB_REF_NAME}"
+
+            echo "Results uploaded successfully"
+            echo "View at: https://exo-benchmark-results.s3.amazonaws.com/$S3_KEY"
+          else
+            echo "Results file not found at: $RESULTS_FILE"
+            echo "Skipping upload"
+          fi
+        shell: bash
+
+      - name: Cleanup EXO_HOME
+        run: |
+          echo "Cleaning up EXO_HOME: $EXO_HOME"
+          rm -rf "$EXO_HOME"
+        shell: bash
+        if: always()
--- a/.github/workflows/build-app.yml
+++ b/.github/workflows/build-app.yml
@@ -1,18 +1,6 @@
 name: Build EXO macOS DMG

-# Release workflow:
-# 1. Create a draft GitHub Release with the tag name (e.g. v1.0.0) and write release notes in markdown
-# 2. Push the tag: git tag v1.0.0 && git push origin v1.0.0
-# 3. This workflow builds, signs, and notarizes the DMG
-# 4. Release notes are embedded in appcast.xml for Sparkle (rendered as markdown)
-# 5. DMG and appcast.xml are uploaded to S3
-# 6. The draft GitHub Release is published with the DMG attached
-#
-# For alpha releases (e.g. v1.0.0-alpha.1): draft release and notes are optional.
-# If no draft exists, a release is auto-created with generated notes.
-
 on:
-  workflow_dispatch:
  push:
    tags:
      - "v*"
@@ -22,17 +10,14 @@ on:
 jobs:
  build-macos-app:
    runs-on: "macos-26"
-    permissions:
-      contents: write
    env:
-      SPARKLE_VERSION: 2.9.0-beta.1
+      SPARKLE_VERSION: 2.8.1
      SPARKLE_DOWNLOAD_PREFIX: ${{ secrets.SPARKLE_DOWNLOAD_PREFIX }}
      SPARKLE_FEED_URL: ${{ secrets.SPARKLE_FEED_URL }}
      SPARKLE_ED25519_PUBLIC: ${{ secrets.SPARKLE_ED25519_PUBLIC }}
      SPARKLE_ED25519_PRIVATE: ${{ secrets.SPARKLE_ED25519_PRIVATE }}
      SPARKLE_S3_BUCKET: ${{ secrets.SPARKLE_S3_BUCKET }}
      SPARKLE_S3_PREFIX: ${{ secrets.SPARKLE_S3_PREFIX }}
-      EXO_BUG_REPORT_PRESIGNED_URL_ENDPOINT: ${{ secrets.EXO_BUG_REPORT_PRESIGNED_URL_ENDPOINT }}
      AWS_REGION: ${{ secrets.AWS_REGION }}
      EXO_BUILD_NUMBER: ${{ github.run_number }}
      EXO_LIBP2P_NAMESPACE: ${{ github.ref_name }}
@@ -49,7 +34,7 @@ jobs:

      - name: Derive release version from tag
        run: |
-          if [[ "$GITHUB_REF_NAME" == "test-app" || "${{ github.event_name }}" == "workflow_dispatch" ]]; then
+          if [[ "$GITHUB_REF_NAME" == "test-app" ]]; then
            VERSION="0.0.0-alpha.0"
            echo "IS_ALPHA=true" >> $GITHUB_ENV
          else
@@ -62,32 +47,6 @@ jobs:
          fi
          echo "RELEASE_VERSION=$VERSION" >> $GITHUB_ENV

-      - name: Compute build version from semver
-        run: |
-          VERSION="$RELEASE_VERSION"
-          # Extract major.minor.patch (strip prerelease suffix)
-          BASE_VERSION="${VERSION%%-*}"
-          MAJOR=$(echo "$BASE_VERSION" | cut -d. -f1)
-          MINOR=$(echo "$BASE_VERSION" | cut -d. -f2)
-          PATCH=$(echo "$BASE_VERSION" | cut -d. -f3)
-
-          # Extract prerelease number (e.g., "alpha.2" -> 2, or 999 for releases)
-          if [[ "$VERSION" == *-* ]]; then
-            PRERELEASE_PART="${VERSION#*-}"
-            PRERELEASE_NUM="${PRERELEASE_PART##*.}"
-            # Default to 0 if not a number
-            if ! [[ "$PRERELEASE_NUM" =~ ^[0-9]+$ ]]; then
-              PRERELEASE_NUM=0
-            fi
-          else
-            PRERELEASE_NUM=999
-          fi
-
-          # Compute: PRERELEASE + (1000 * PATCH) + (1_000_000 * MINOR) + (1_000_000_000 * MAJOR)
-          BUILD_VERSION=$((PRERELEASE_NUM + 1000 * PATCH + 1000000 * MINOR + 1000000000 * MAJOR))
-          echo "EXO_BUILD_VERSION=$BUILD_VERSION" >> $GITHUB_ENV
-          echo "Computed build version: $BUILD_VERSION from $VERSION"
-
      - name: Ensure tag commit is on main
        if: github.ref_type == 'tag'
        run: |
@@ -100,52 +59,6 @@ jobs:
            exit 1
          fi

-      - name: Fetch and validate release notes
-        if: github.ref_type == 'tag'
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-        run: |
-          # Find draft release by name using gh release list (more reliable with default token)
-          echo "Looking for draft release named '$GITHUB_REF_NAME'..."
-          DRAFT_EXISTS=$(gh release list --json name,isDraft --jq ".[] | select(.isDraft == true) | select(.name == \"$GITHUB_REF_NAME\") | .name" 2>/dev/null || echo "")
-
-          if [[ -z "$DRAFT_EXISTS" ]]; then
-            if [[ "$IS_ALPHA" == "true" ]]; then
-              echo "No draft release found for alpha tag $GITHUB_REF_NAME (optional for alphas)"
-              echo "HAS_RELEASE_NOTES=false" >> $GITHUB_ENV
-              exit 0
-            fi
-            echo "ERROR: No draft release found for tag $GITHUB_REF_NAME"
-            echo "Please create a draft release with release notes before pushing the tag."
-            exit 1
-          fi
-
-          # Fetch full release details via API to get body and ID
-          echo "Found draft release, fetching details..."
-          RELEASE_JSON=$(gh api repos/${{ github.repository }}/releases --jq ".[] | select(.draft == true) | select(.name == \"$GITHUB_REF_NAME\")" 2>/dev/null || echo "")
-
-          # Extract release notes
-          NOTES=$(echo "$RELEASE_JSON" | jq -r '.body // ""')
-          if [[ -z "$NOTES" || "$NOTES" == "null" ]]; then
-            if [[ "$IS_ALPHA" == "true" ]]; then
-              echo "Draft release has no notes (optional for alphas)"
-              echo "HAS_RELEASE_NOTES=false" >> $GITHUB_ENV
-              exit 0
-            fi
-            echo "ERROR: Draft release exists but has no release notes"
-            echo "Please add release notes to the draft release before pushing the tag."
-            exit 1
-          fi
-
-          # Save release ID for later publishing
-          RELEASE_ID=$(echo "$RELEASE_JSON" | jq -r '.id')
-          echo "DRAFT_RELEASE_ID=$RELEASE_ID" >> $GITHUB_ENV
-          echo "HAS_RELEASE_NOTES=true" >> $GITHUB_ENV
-
-          echo "Found draft release (ID: $RELEASE_ID), saving release notes..."
-          echo "$NOTES" > /tmp/release_notes.md
-          echo "RELEASE_NOTES_FILE=/tmp/release_notes.md" >> $GITHUB_ENV
-
      # ============================================================
      # Install dependencies
      # ============================================================
@@ -172,22 +85,11 @@ jobs:
          uv python install
          uv sync --locked

-      - name: Install Nix
-        uses: cachix/install-nix-action@v31
-        with:
-          nix_path: nixpkgs=channel:nixos-unstable
-
-      - name: Configure Cachix
-        uses: cachix/cachix-action@v14
-        with:
-          name: exo
-          authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
-
      - name: Build dashboard
        run: |
-          DASHBOARD_OUT=$(nix build .#dashboard --print-build-logs --no-link --print-out-paths)
-          mkdir -p dashboard/build
-          cp -r "$DASHBOARD_OUT"/* dashboard/build/
+          cd dashboard
+          npm ci
+          npm run build

      - name: Install Sparkle CLI
        run: |
@@ -260,12 +162,11 @@ jobs:
            -configuration Release \
            -derivedDataPath build \
            MARKETING_VERSION="$RELEASE_VERSION" \
-            CURRENT_PROJECT_VERSION="$EXO_BUILD_VERSION" \
+            CURRENT_PROJECT_VERSION="$EXO_BUILD_NUMBER" \
            EXO_BUILD_TAG="$RELEASE_VERSION" \
            EXO_BUILD_COMMIT="$GITHUB_SHA" \
            SPARKLE_FEED_URL="$SPARKLE_FEED_URL" \
            SPARKLE_ED25519_PUBLIC="$SPARKLE_ED25519_PUBLIC" \
-            EXO_BUG_REPORT_PRESIGNED_URL_ENDPOINT="$EXO_BUG_REPORT_PRESIGNED_URL_ENDPOINT" \
            CODE_SIGNING_IDENTITY="$SIGNING_IDENTITY" \
            CODE_SIGN_INJECT_BASE_ENTITLEMENTS=YES
          mkdir -p ../../output
@@ -363,28 +264,6 @@ jobs:
            $CHANNEL_FLAG \
            .

-      - name: Inject release notes into appcast
-        if: github.ref_type == 'tag' && env.HAS_RELEASE_NOTES == 'true'
-        env:
-          RELEASE_VERSION: ${{ env.RELEASE_VERSION }}
-        run: |
-          # Inject markdown release notes with sparkle:format="markdown" (Sparkle 2.9+)
-          export NOTES=$(cat "$RELEASE_NOTES_FILE")
-
-          # Insert description after the enclosure tag for this version
-          awk '
-            /<enclosure[^>]*>/ && index($0, ENVIRON["RELEASE_VERSION"]) {
-              print
-              print "            <description sparkle:format=\"markdown\"><![CDATA["
-              print ENVIRON["NOTES"]
-              print "            ]]></description>"
-              next
-            }
-            { print }
-          ' output/appcast.xml > output/appcast.xml.tmp && mv output/appcast.xml.tmp output/appcast.xml
-
-          echo "Injected markdown release notes for version $RELEASE_VERSION"
-
      # ============================================================
      # Upload artifacts
      # ============================================================
@@ -415,28 +294,5 @@ jobs:
          aws s3 cp "$DMG_NAME" "s3://${SPARKLE_S3_BUCKET}/${PREFIX}${DMG_NAME}"
          if [[ "$IS_ALPHA" != "true" ]]; then
            aws s3 cp "$DMG_NAME" "s3://${SPARKLE_S3_BUCKET}/${PREFIX}EXO-latest.dmg"
-            aws s3 cp appcast.xml "s3://${SPARKLE_S3_BUCKET}/${PREFIX}appcast.xml" --content-type application/xml --cache-control no-cache
-          fi
-
-      - name: Publish GitHub Release
-        if: github.ref_type == 'tag'
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-        run: |
-          DMG_PATH="output/EXO-${RELEASE_VERSION}.dmg"
-
-          if [[ "$HAS_RELEASE_NOTES" == "true" ]]; then
-            # Update the draft release with the tag and upload DMG
-            gh api --method PATCH "repos/${{ github.repository }}/releases/$DRAFT_RELEASE_ID" \
-              -f tag_name="$GITHUB_REF_NAME" \
-              -F draft=false
-            gh release upload "$GITHUB_REF_NAME" "$DMG_PATH" --clobber
-            echo "Published release $GITHUB_REF_NAME with DMG attached"
-          else
-            # Alpha without draft release - create one with auto-generated notes
-            gh release create "$GITHUB_REF_NAME" "$DMG_PATH" \
-              --title "$GITHUB_REF_NAME" \
-              --generate-notes \
-              --prerelease
-            echo "Created alpha release $GITHUB_REF_NAME with auto-generated notes"
          fi
+          aws s3 cp appcast.xml "s3://${SPARKLE_S3_BUCKET}/${PREFIX}appcast.xml" --content-type application/xml --cache-control no-cache
--- a/.github/workflows/pipeline.yml
+++ b/.github/workflows/pipeline.yml
@@ -20,12 +20,6 @@ jobs:
        with:
          nix_path: nixpkgs=channel:nixos-unstable

-      - uses: cachix/cachix-action@v14
-        name: Configure Cachix
-        with:
-          name: exo
-          authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
-
      - name: Configure git user
        run: |
          git config --local user.email "github-actions@users.noreply.github.com"
@@ -94,19 +88,9 @@ jobs:

      - uses: ./.github/actions/typecheck

-  nix:
-    name: Build and check (${{ matrix.system }})
-    runs-on: ${{ matrix.runner }}
-    strategy:
-      fail-fast: false
-      matrix:
-        include:
-          - runner: macos-26
-            system: aarch64-darwin
-          - runner: ubuntu-latest
-            system: x86_64-linux
-          - runner: ubuntu-24.04-arm
-            system: aarch64-linux
+  nix-flake-check:
+    name: Check Nix flake
+    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
@@ -117,20 +101,83 @@ jobs:
        with:
          nix_path: nixpkgs=channel:nixos-unstable

-      - uses: cachix/cachix-action@v14
-        name: Configure Cachix
-        with:
-          name: exo
-          authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
-
-      - name: Build all Nix outputs
-        run: |
-          nix flake show --json | jq -r '
-            [
-              (.packages."${{ matrix.system }}" // {} | keys[] | ".#packages.${{ matrix.system }}.\(.)"),
-              (.devShells."${{ matrix.system }}" // {} | keys[] | ".#devShells.${{ matrix.system }}.\(.)")
-            ] | .[]
-          ' | xargs nix build
-
      - name: Run nix flake check
-        run: nix flake check
+        run: |
+          nix flake check
+        shell: bash
+
+#  ci:
+#    needs: typecheck
+#    runs-on: ubuntu-latest
+#    permissions:
+#      contents: read
+#    env:
+#      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+#    steps:
+#      - name: Checkout repository
+#        uses: actions/checkout@v4
+#        with:
+#          fetch-depth: 0
+#          token: ${{ secrets.GITHUB_TOKEN }}
+#          lfs: true
+#
+#      - name: Configure git user
+#        run: |
+#          git config --local user.email "github-actions@users.noreply.github.com"
+#          git config --local user.name  "github-actions bot"
+#        shell: bash
+#
+#      - name: Pull LFS files
+#        run: |
+#          echo "Pulling Git LFS files..."
+#          git lfs pull
+#        shell: bash
+#
+#      - name: Setup EXO_HOME and API_PORT
+#        run: |
+#          EXO_HOME=$(mktemp -d -t exo-ci-XXXXXXXX)
+#          # Generate random port (macOS compatible method)
+#          API_PORT=$((49152 + RANDOM % (65535 - 49152 + 1)))
+#          echo "EXO_HOME=$EXO_HOME" >> $GITHUB_ENV
+#          echo "API_PORT=$API_PORT" >> $GITHUB_ENV
+#          echo "Created EXO_HOME: $EXO_HOME"
+#          echo "Generated API_PORT: $API_PORT"
+#        shell: bash
+#
+#      - name: Setup Nix Environment
+#        run: |
+#          echo "Checking for nix installation..."
+#          
+#          # Check if nix binary exists directly
+#          if [ -f /nix/var/nix/profiles/default/bin/nix ]; then
+#            echo "Found nix binary at /nix/var/nix/profiles/default/bin/nix"
+#            export PATH="/nix/var/nix/profiles/default/bin:$PATH"
+#            echo "PATH=$PATH" >> $GITHUB_ENV
+#            nix --version
+#          elif [ -f /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh ]; then
+#            echo "Found nix profile script, sourcing..."
+#            source /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh
+#            nix --version
+#          elif command -v nix >/dev/null 2>&1; then
+#            echo "Nix already in PATH"
+#            nix --version
+#          else
+#            echo "Nix not found. Debugging info:"
+#            echo "Contents of /nix/var/nix/profiles/default/:"
+#            ls -la /nix/var/nix/profiles/default/ 2>/dev/null || echo "Directory not found"
+#            echo "Contents of /nix/var/nix/profiles/default/bin/:"
+#            ls -la /nix/var/nix/profiles/default/bin/ 2>/dev/null || echo "Directory not found"
+#            exit 1
+#          fi
+#        shell: bash
+#
+#      - uses: ./.github/actions/lint-check
+#
+#      - uses: ./.github/actions/unit-test
+#
+#      - name: Cleanup EXO_HOME
+#        run: |
+#          echo "Cleaning up EXO_HOME: $EXO_HOME"
+#          rm -rf "$EXO_HOME"
+#        shell: bash
+#        if: always()
--- a/.gitignore
+++ b/.gitignore
@@ -16,7 +16,6 @@ digest.txt
 *.xcuserdatad/
 **/.DS_Store
 app/EXO/build/
-dist/


 # rust
--- a/.mlx_typings/mlx_lm/models/deepseek_v3.pyi
+++ b/.mlx_typings/mlx_lm/models/deepseek_v3.pyi
@@ -1,156 +0,0 @@
-"""Type stubs for mlx_lm.models.deepseek_v3"""
-
-from dataclasses import dataclass
-from typing import Any, Dict, Optional
-
-import mlx.core as mx
-import mlx.nn as nn
-
-from .base import BaseModelArgs
-from .switch_layers import SwitchGLU
-
-@dataclass
-class ModelArgs(BaseModelArgs):
-    model_type: str
-    vocab_size: int
-    hidden_size: int
-    intermediate_size: int
-    moe_intermediate_size: int
-    num_hidden_layers: int
-    num_attention_heads: int
-    num_key_value_heads: int
-    n_shared_experts: Optional[int]
-    n_routed_experts: Optional[int]
-    routed_scaling_factor: float
-    kv_lora_rank: int
-    q_lora_rank: Optional[int]
-    qk_rope_head_dim: int
-    v_head_dim: int
-    qk_nope_head_dim: int
-    topk_method: str
-    scoring_func: str
-    norm_topk_prob: bool
-    n_group: int
-    topk_group: int
-    num_experts_per_tok: int
-    moe_layer_freq: int
-    first_k_dense_replace: int
-    max_position_embeddings: int
-    rms_norm_eps: float
-    rope_theta: float
-    rope_scaling: Optional[Dict[str, Any]]
-    attention_bias: bool
-
-class DeepseekV3Attention(nn.Module):
-    config: ModelArgs
-    hidden_size: int
-    num_heads: int
-    max_position_embeddings: int
-    rope_theta: float
-    q_lora_rank: Optional[int]
-    qk_rope_head_dim: int
-    kv_lora_rank: int
-    v_head_dim: int
-    qk_nope_head_dim: int
-    q_head_dim: int
-    scale: float
-    q_proj: nn.Linear
-    q_a_proj: nn.Linear
-    q_a_layernorm: nn.RMSNorm
-    q_b_proj: nn.Linear
-    kv_a_proj_with_mqa: nn.Linear
-    kv_a_layernorm: nn.RMSNorm
-    kv_b_proj: nn.Linear
-    o_proj: nn.Linear
-    rope: Any
-
-    def __init__(self, config: ModelArgs) -> None: ...
-    def __call__(
-        self,
-        x: mx.array,
-        mask: Optional[mx.array] = None,
-        cache: Optional[Any] = None,
-    ) -> mx.array: ...
-
-class DeepseekV3MLP(nn.Module):
-    config: ModelArgs
-    hidden_size: int
-    intermediate_size: int
-    gate_proj: nn.Linear
-    up_proj: nn.Linear
-    down_proj: nn.Linear
-
-    def __init__(
-        self,
-        config: ModelArgs,
-        hidden_size: Optional[int] = None,
-        intermediate_size: Optional[int] = None,
-    ) -> None: ...
-    def __call__(self, x: mx.array) -> mx.array: ...
-
-class MoEGate(nn.Module):
-    config: ModelArgs
-    top_k: int
-    norm_topk_prob: bool
-    n_routed_experts: Optional[int]
-    routed_scaling_factor: float
-    n_group: int
-    topk_group: int
-    weight: mx.array
-    e_score_correction_bias: mx.array
-
-    def __init__(self, config: ModelArgs) -> None: ...
-    def __call__(self, x: mx.array) -> tuple[mx.array, mx.array]: ...
-
-class DeepseekV3MoE(nn.Module):
-    config: ModelArgs
-    num_experts_per_tok: int
-    switch_mlp: SwitchGLU
-    gate: MoEGate
-    shared_experts: DeepseekV3MLP
-    sharding_group: Optional[mx.distributed.Group]
-
-    def __init__(self, config: ModelArgs) -> None: ...
-    def __call__(self, x: mx.array) -> mx.array: ...
-
-class DeepseekV3DecoderLayer(nn.Module):
-    self_attn: DeepseekV3Attention
-    mlp: DeepseekV3MLP | DeepseekV3MoE
-    input_layernorm: nn.RMSNorm
-    post_attention_layernorm: nn.RMSNorm
-
-    def __init__(self, config: ModelArgs, layer_idx: int) -> None: ...
-    def __call__(
-        self,
-        x: mx.array,
-        mask: Optional[mx.array] = None,
-        cache: Optional[Any] = None,
-    ) -> mx.array: ...
-
-class DeepseekV3Model(nn.Module):
-    vocab_size: int
-    embed_tokens: nn.Embedding
-    layers: list[DeepseekV3DecoderLayer]
-    norm: nn.RMSNorm
-
-    def __init__(self, config: ModelArgs) -> None: ...
-    def __call__(
-        self,
-        x: mx.array,
-        cache: Optional[Any] = None,
-    ) -> mx.array: ...
-
-class Model(nn.Module):
-    model_type: str
-    model: DeepseekV3Model
-    lm_head: nn.Linear
-
-    def __init__(self, config: ModelArgs) -> None: ...
-    def __call__(
-        self,
-        inputs: mx.array,
-        cache: Optional[Any] = None,
-    ) -> mx.array: ...
-    def sanitize(self, weights: dict[str, Any]) -> dict[str, Any]: ...
-    @property
-    def layers(self) -> list[DeepseekV3DecoderLayer]: ...
--- a/.mlx_typings/mlx_lm/models/switch_layers.pyi
+++ b/.mlx_typings/mlx_lm/models/switch_layers.pyi
@@ -57,11 +57,6 @@ class SwiGLU(nn.Module):
    def __call__(self, x, gate): ...

 class SwitchGLU(nn.Module):
-    gate_proj: SwitchLinear
-    up_proj: SwitchLinear
-    down_proj: SwitchLinear
-    activation: SwiGLU
-
    def __init__(
        self,
        input_dims: int,
--- a/.mlx_typings/mlx_lm/tokenizer_utils.pyi
+++ b/.mlx_typings/mlx_lm/tokenizer_utils.pyi
@@ -4,7 +4,6 @@ This type stub file was generated by pyright.

 from functools import partial
 from pathlib import Path
-from typing import Any

 from transformers import PreTrainedTokenizerFast

@@ -104,55 +103,37 @@ class TokenizerWrapper:
    Accessing any attribute other than the ``detokenizer`` is forwarded to the
    huggingface tokenizer.
    """
+    def __init__(self, tokenizer, detokenizer_class=..., eos_token_ids=...) -> None: ...
+    def add_eos_token(self, token: str):  # -> None:
+        ...
+    @property
+    def has_thinking(self):  # -> bool:
+        ...
+    @property
+    def think_start(self):  # -> str | None:
+        ...
+    @property
+    def think_end(self):  # -> str | None:
+        ...
+    @property
+    def has_tool_calling(self):  # -> bool:
+        ...
+    @property
+    def tool_call_start(self):  # -> str | None:
+        ...
+    @property
+    def tool_call_end(self):  # -> str | None:
+        ...
+    @property
+    def detokenizer(self):  # -> NaiveStreamingDetokenizer:
+        """
+        Get a stateful streaming detokenizer.
+        """

-    _tokenizer: PreTrainedTokenizerFast
-    eos_token_id: int | None
-    eos_token: str | None
-    bos_token_id: int | None
-    bos_token: str | None
-    vocab_size: int
-    all_special_tokens: list[str]
-
-    def __init__(
-        self,
-        tokenizer: Any,
-        detokenizer_class: Any = ...,
-        eos_token_ids: list[int] | None = ...,
-        chat_template: Any = ...,
-        tool_parser: Any = ...,
-        tool_call_start: str | None = ...,
-        tool_call_end: str | None = ...,
-    ) -> None: ...
-    def encode(self, text: str, **kwargs: Any) -> list[int]: ...
-    def decode(self, token_ids: list[int], **kwargs: Any) -> str: ...
-    def apply_chat_template(
-        self,
-        messages: list[dict[str, Any]],
-        tokenize: bool = False,
-        add_generation_prompt: bool = False,
-        tools: Any = None,
-        **kwargs: Any,
-    ) -> str: ...
-    def get_vocab(self) -> dict[str, int]: ...
-    def add_eos_token(self, token: str) -> None: ...
-    @property
-    def has_thinking(self) -> bool: ...
-    @property
-    def think_start(self) -> str | None: ...
-    @property
-    def think_end(self) -> str | None: ...
-    @property
-    def has_tool_calling(self) -> bool: ...
-    @property
-    def tool_call_start(self) -> str | None: ...
-    @property
-    def tool_call_end(self) -> str | None: ...
-    @property
-    def detokenizer(self) -> NaiveStreamingDetokenizer:
-        """Get a stateful streaming detokenizer."""
-
-    def __getattr__(self, attr: str) -> Any: ...
-    def __setattr__(self, attr: str, value: Any) -> None: ...
+    def __getattr__(self, attr):  # -> set[Any] | Any:
+        ...
+    def __setattr__(self, attr, value):  # -> None:
+        ...

 class NewlineTokenizer(PreTrainedTokenizerFast):
    """A tokenizer that replaces newlines with <n> and <n> with new line."""
@@ -165,11 +146,18 @@ class NewlineTokenizer(PreTrainedTokenizerFast):
    def batch_decode(self, *args, **kwargs):  # -> list[str]:
        ...

-def load(
+def load_tokenizer(
    model_path: Path,
-    tokenizer_config_extra: dict[str, Any] | None = None,
-    eos_token_ids: list[int] | int | None = None,
-) -> TokenizerWrapper:
+    tokenizer_config_extra=...,
+    return_tokenizer=...,
+    eos_token_ids=...,
+) -> (
+    TokenizerWrapper
+    | type[SPMStreamingDetokenizer]
+    | partial[SPMStreamingDetokenizer]
+    | type[BPEStreamingDetokenizer]
+    | type[NaiveStreamingDetokenizer]
+):
    """Load a huggingface tokenizer and try to infer the type of streaming
    detokenizer to use.

@@ -177,7 +165,4 @@ def load(
    a Hugging Face repo ID.
    """

-# Alias for backward compatibility
-load_tokenizer = load
-
-def no_bos_or_eos(sequence: list[int], bos: int, eos: int) -> list[int]: ...
+def no_bos_or_eos(sequence: list, bos: int, eos: int) -> list: ...
--- a/.prettierrc
+++ b/.prettierrc
@@ -1,3 +0,0 @@
-{
-  "useTabs": true
-}
--- a/.swift-format
+++ b/.swift-format
@@ -1,6 +0,0 @@
-{
-  "version": 1,
-  "indentation": {
-    "spaces": 4
-  }
-}
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,121 +0,0 @@
-# AGENTS.md
-
-This file provides guidance to AI coding agents when working with code in this repository.
-
-## Project Overview
-
-exo is a distributed AI inference system that connects multiple devices into a cluster. It enables running large language models across multiple machines using MLX as the inference backend and libp2p for peer-to-peer networking.
-
-## Build & Run Commands
-
-```bash
-# Build the dashboard (required before running exo)
-cd dashboard && npm install && npm run build && cd ..
-
-# Run exo (starts both master and worker with API at http://localhost:52415)
-uv run exo
-
-# Run with verbose logging
-uv run exo -v   # or -vv for more verbose
-
-# Run tests (excludes slow tests by default)
-uv run pytest
-
-# Run all tests including slow tests
-uv run pytest -m ""
-
-# Run a specific test file
-uv run pytest src/exo/shared/tests/test_election.py
-
-# Run a specific test function
-uv run pytest src/exo/shared/tests/test_election.py::test_function_name
-
-# Type checking (strict mode)
-uv run basedpyright
-
-# Linting
-uv run ruff check
-
-# Format code (using nix)
-nix fmt
-```
-
-## Pre-Commit Checks (REQUIRED)
-
-**IMPORTANT: Always run these checks before committing code. CI will fail if these don't pass.**
-
-```bash
-# 1. Type checking - MUST pass with 0 errors
-uv run basedpyright
-
-# 2. Linting - MUST pass
-uv run ruff check
-
-# 3. Formatting - MUST be applied
-nix fmt
-
-# 4. Tests - MUST pass
-uv run pytest
-```
-
-Run all checks in sequence:
-```bash
-uv run basedpyright && uv run ruff check && nix fmt && uv run pytest
-```
-
-If `nix fmt` changes any files, stage them before committing. The CI runs `nix flake check` which verifies formatting, linting, and runs Rust tests.
-
-## Architecture
-
-### Node Composition
-A single exo `Node` (src/exo/main.py) runs multiple components:
- **Router**: libp2p-based pub/sub messaging via Rust bindings (exo_pyo3_bindings)
- **Worker**: Handles inference tasks, downloads models, manages runner processes
- **Master**: Coordinates cluster state, places model instances across nodes
- **Election**: Bully algorithm for master election
- **API**: FastAPI server for OpenAI-compatible chat completions
-
-### Message Flow
-Components communicate via typed pub/sub topics (src/exo/routing/topics.py):
- `GLOBAL_EVENTS`: Master broadcasts indexed events to all workers
- `LOCAL_EVENTS`: Workers send events to master for indexing
- `COMMANDS`: Workers/API send commands to master
- `ELECTION_MESSAGES`: Election protocol messages
- `CONNECTION_MESSAGES`: libp2p connection updates
-
-### Event Sourcing
-The system uses event sourcing for state management:
- `State` (src/exo/shared/types/state.py): Immutable state object
- `apply()` (src/exo/shared/apply.py): Pure function that applies events to state
- Master indexes events and broadcasts; workers apply indexed events
-
-### Key Type Hierarchy
- `src/exo/shared/types/`: Pydantic models for all shared types
-  - `events.py`: Event types (discriminated union)
-  - `commands.py`: Command types
-  - `tasks.py`: Task types for worker execution
-  - `state.py`: Cluster state model
-
-### Rust Components
-Rust code in `rust/` provides:
- `networking`: libp2p networking (gossipsub, peer discovery)
- `exo_pyo3_bindings`: PyO3 bindings exposing Rust to Python
- `system_custodian`: System-level operations
-
-### Dashboard
-Svelte 5 + TypeScript frontend in `dashboard/`. Build output goes to `dashboard/build/` and is served by the API.
-
-## Code Style Requirements
-
-From .cursorrules:
- Strict, exhaustive typing - never bypass the type-checker
- Use `Literal[...]` for enum-like sets, `typing.NewType` for primitives
- Pydantic models with `frozen=True` and `strict=True`
- Pure functions with injectable effect handlers for side-effects
- Descriptive names - no abbreviations or 3-letter acronyms
- Catch exceptions only where you can handle them meaningfully
- Use `@final` and immutability wherever applicable
-
-## Testing
-
-Tests use pytest-asyncio with `asyncio_mode = "auto"`. Tests are in `tests/` subdirectories alongside the code they test. The `EXO_TESTS=1` env var is set during tests.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1 +0,0 @@
-AGENTS.md
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4340,6 +4340,25 @@ dependencies = [
 "libc",
 ]

+[[package]]
+name = "system_custodian"
+version = "0.0.1"
+dependencies = [
+ "delegate",
+ "derive_more",
+ "either",
+ "extend",
+ "futures",
+ "futures-timer",
+ "impl-trait-for-tuples",
+ "keccak-const",
+ "log",
+ "thiserror 2.0.17",
+ "tokio",
+ "tracing-subscriber",
+ "util",
+]
+
 [[package]]
 name = "tagptr"
 version = "0.2.0"
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -3,6 +3,7 @@ resolver = "3"
 members = [
    "rust/networking",
    "rust/exo_pyo3_bindings",
+    "rust/system_custodian",
    "rust/util",
 ]

@@ -24,6 +25,7 @@ opt-level = 3
 [workspace.dependencies]
 ## Crate members as common dependencies
 networking = { path = "rust/networking" }
+system_custodian = { path = "rust/system_custodian" }
 util = { path = "rust/util" }

 # Proc-macro authoring tools
--- a/MISSED_THINGS.md
+++ b/MISSED_THINGS.md
@@ -1,41 +0,0 @@
-# Missed things
-[X] Log EXO_LIBP2P_NAMESPACE on start in exo/main.py
-[X] Ordering of warmup was changed, which is wrong. It was changed to rank < n-1, then rank=n-1. It should be rank!=0 then rank=0 (this matches the auto_parallel implementation. NOTE: we use a different convention to mlx-lm, our terminal rank is rank=n-1 whereas mlx-lm is rank=0 hence i can see why this was changed wrongly).
-[X] Downloads keying by model_id not shard_metadata (worker/plan.py, worker/main.py).
-[X] Fetching download status of all models on start
-[X] Deduplication of tasks in plan_step.
-[X] resolve_allow_patterns should just be wildcard now.
-[] no mx_barrier in genreate.py mlx_generate at the end.
-[] cache assertion not needed in auto_parallel.py PipelineLastLayer.
-[] GPTOSS support dropped in auto_parallel.py.
-[] sharding changed "all-to-sharded" became _all_to_sharded in auto_parallel.py.
-[] same as above with "sharded-to-all" became _sharded_to_all in auto_parallel.py.
-[] Dropped support for Ministral3Model, DeepseekV32Model, Glm4MoeModel, Qwen3NextModel, GptOssMode in auto_parallel.py.
-[] Dropped prefill/decode code in auto_parallel.py and utils_mlx.py.
-[X] KV_CACHE_BITS should be None to disable quantized KV cache.
-[] Dropped _set_nofile_limit in utils_mlx.py.
-[] We have group optional in load_mlx_items in utils_mlx.py.
-[] Dropped add_missing_chat_templates for GptOss in load_mlx_items in utils_mlx.py.
-[] Dropped model.make_cache in make_kv_cache in utils_mlx.py.
-[X] We put cache limit back in utils_mlx.py.
-[] topology.py remove_node removes the connections after checking if node is is in self._node_id_to_rx_id_map. on beta_1 it checks after, so would remove stale connections I guess?
-[] Missing Glm 4.7 model cards (this isn't ready yet but should be picked up, probably create an issue... the blocker is transforemrs version doesn't support the tokenizer for Glm 4.7. rc-1 does but we can't upgrade as it breaks other things.)
-[] try-except in _command_processor only excepts ValueError. This was silently failing leading to un-debuggable errors (we had a KeyError that was happening ). Changed this to catch Exception instead of ValueError. See exo-v2 89ae38405e0052e3c22405daf094b065878aa873 and fb99fea69b5a39017efc90c5dad0072e677455f0.
-[X] In placement.py, place_instance no longer looks at model_meta.supports_tensor and check if this tensor parallel number of nodes is supported by the model's tensor dimensions.
-[X] In placement.py, place_instanec, we no longer have the special case to exclude DeepSeek v3.1 pipeline parallel (it doesn't work).
-[] logger.warning("You have likely selected ibv for a single node instance; falling back to MlxRing") was changed to debug. That will spam this warning since it happens every time we query instance previews.
-[X] In placement_utils.py, get_mlx_jaccl_coordinators, We no longer prioritise Jaccl Coordinator IP. Now it picks the first one, which is unstable (Jaccl coordinator over TB5 is unstable).
-
-
-
-[X] Downloads keying by model_id not shard_metadata (worker/plan.py, worker/main.py).
-[X] Fetching download status of all models on start
-[X] Deduplication of tasks in plan_step.
-[X] resolve_allow_patterns should just be wildcard now.
-[X] KV_CACHE_BITS should be None to disable quantized KV cache.
-[X] We put cache limit back in utils_mlx.py.
-[X] In placement.py, place_instance no longer looks at model_meta.supports_tensor and check if this tensor parallel number of nodes is supported by the model's tensor dimensions.
-[X] In placement.py, place_instanec, we no longer have the special case to exclude DeepSeek v3.1 pipeline parallel (it doesn't work).
-[X] In placement_utils.py, get_mlx_jaccl_coordinators, We no longer prioritise Jaccl Coordinator IP. Now it picks the first one, which is unstable (Jaccl coordinator over TB5 is unstable).
-
-
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
 exo: Run your own AI cluster at home with everyday devices. Maintained by [exo labs](https://x.com/exolabs).

 <p align="center">
-  <a href="https://discord.gg/TJ4P57arEm" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Discord-Join%20Server-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
+  <a href="https://discord.gg/72NsF6ux" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Discord-Join%20Server-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
  <a href="https://x.com/exolabs" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/twitter/follow/exolabs?style=social" alt="X"></a>
  <a href="https://www.apache.org/licenses/LICENSE-2.0.html" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/License-Apache2.0-blue.svg" alt="License: Apache-2.0"></a>
 </p>
@@ -166,24 +166,6 @@ Download the latest build here: [EXO-latest.dmg](https://assets.exolabs.net/EXO-

 The app will ask for permission to modify system settings and install a new Network profile. Improvements to this are being worked on.

-#### Uninstalling the macOS App
-
-The recommended way to uninstall is through the app itself: click the menu bar icon → Advanced → Uninstall. This cleanly removes all system components.
-
-If you've already deleted the app, you can run the standalone uninstaller script:
-
-```bash
-sudo ./app/EXO/uninstall-exo.sh
-```
-
-This removes:
- Network setup LaunchDaemon
- Network configuration script
- Log files
- The "exo" network location
-
-**Note:** You'll need to manually remove EXO from Login Items in System Settings → General → Login Items.
-
 ---

 ### Enabling RDMA on macOS
@@ -305,10 +287,7 @@ curl -X DELETE http://localhost:52415/instance/YOUR_INSTANCE_ID
 - List all models: `curl http://localhost:52415/models`
 - Inspect instance IDs and deployment state: `curl http://localhost:52415/state`

-For further details, see:
-
- API basic documentation in [docs/api.md](docs/api.md).
- API types and endpoints in [src/exo/master/api.py](src/exo/master/api.py).
+For further details, see API types and endpoints in [src/exo/master/api.py](src/exo/master/api.py).

 ---

--- a/app/EXO/EXO.xcodeproj/project.pbxproj
+++ b/app/EXO/EXO.xcodeproj/project.pbxproj
@@ -585,7 +585,7 @@
 			repositoryURL = "https://github.com/sparkle-project/Sparkle.git";
 			requirement = {
 				kind = upToNextMajorVersion;
-				minimumVersion = 2.9.0-beta.1;
+				minimumVersion = 2.8.1;
 			};
 		};
 /* End XCRemoteSwiftPackageReference section */
--- a/app/EXO/EXO.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved
+++ b/app/EXO/EXO.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved
@@ -6,8 +6,8 @@
      "kind" : "remoteSourceControl",
      "location" : "https://github.com/sparkle-project/Sparkle.git",
      "state" : {
-        "revision" : "e641adb41915a8409895e2e30666aa64e487b637",
-        "version" : "2.9.0-beta.1"
+        "revision" : "5581748cef2bae787496fe6d61139aebe0a451f6",
+        "version" : "2.8.1"
      }
    }
  ],
--- a/app/EXO/EXO/ContentView.swift
+++ b/app/EXO/EXO/ContentView.swift
@@ -12,25 +12,18 @@ struct ContentView: View {
    @EnvironmentObject private var controller: ExoProcessController
    @EnvironmentObject private var stateService: ClusterStateService
    @EnvironmentObject private var networkStatusService: NetworkStatusService
-    @EnvironmentObject private var localNetworkChecker: LocalNetworkChecker
    @EnvironmentObject private var updater: SparkleUpdater
    @State private var focusedNode: NodeViewModel?
    @State private var deletingInstanceIDs: Set<String> = []
    @State private var showAllNodes = false
    @State private var showAllInstances = false
-    @State private var showAdvanced = false
    @State private var showDebugInfo = false
    @State private var bugReportInFlight = false
    @State private var bugReportMessage: String?
-    @State private var uninstallInProgress = false
-    @State private var pendingNamespace: String = ""

    var body: some View {
        VStack(alignment: .leading, spacing: 12) {
            statusSection
-            if shouldShowLocalNetworkWarning {
-                localNetworkWarningBanner
-            }
            if shouldShowClusterDetails {
                Divider()
                overviewSection
@@ -45,7 +38,6 @@ struct ContentView: View {
        }
        .animation(.easeInOut(duration: 0.3), value: shouldShowClusterDetails)
        .animation(.easeInOut(duration: 0.3), value: shouldShowInstances)
-        .animation(.easeInOut(duration: 0.3), value: shouldShowLocalNetworkWarning)
        .padding()
        .frame(width: 340)
        .onAppear {
@@ -55,67 +47,9 @@ struct ContentView: View {
        }
    }

-    private var shouldShowLocalNetworkWarning: Bool {
-        // Show warning if local network is not working and EXO is running.
-        // The checker uses a longer timeout on first launch to allow time for
-        // the permission prompt, so this correctly handles both:
-        // 1. User denied permission on first launch
-        // 2. Permission broke after restart (macOS TCC bug)
-        if case .notWorking = localNetworkChecker.status {
-            return controller.status != .stopped
-        }
-        return false
-    }
-
-    private var localNetworkWarningBanner: some View {
-        VStack(alignment: .leading, spacing: 6) {
-            HStack(spacing: 6) {
-                Image(systemName: "exclamationmark.triangle.fill")
-                    .foregroundColor(.orange)
-                Text("Local Network Access Issue")
-                    .font(.caption)
-                    .fontWeight(.semibold)
-            }
-            Text(
-                "Device discovery won't work. To fix:\n1. Quit EXO\n2. Open System Settings → Privacy & Security → Local Network\n3. Toggle EXO off, then back on\n4. Relaunch EXO"
-            )
-            .font(.caption2)
-            .foregroundColor(.secondary)
-            .fixedSize(horizontal: false, vertical: true)
-            Button {
-                openLocalNetworkSettings()
-            } label: {
-                Text("Open Settings")
-                    .font(.caption2)
-            }
-            .buttonStyle(.bordered)
-            .controlSize(.small)
-        }
-        .padding(8)
-        .background(
-            RoundedRectangle(cornerRadius: 8)
-                .fill(Color.orange.opacity(0.1))
-        )
-        .overlay(
-            RoundedRectangle(cornerRadius: 8)
-                .stroke(Color.orange.opacity(0.3), lineWidth: 1)
-        )
-    }
-
-    private func openLocalNetworkSettings() {
-        // Open Privacy & Security settings - Local Network section
-        if let url = URL(
-            string: "x-apple.systempreferences:com.apple.preference.security?Privacy_LocalNetwork")
-        {
-            NSWorkspace.shared.open(url)
-        }
-    }
-
    private var topologySection: some View {
        Group {
-            if let topology = stateService.latestSnapshot?.topologyViewModel(
-                localNodeId: stateService.localNodeId), !topology.nodes.isEmpty
-            {
+            if let topology = stateService.latestSnapshot?.topologyViewModel(localNodeId: stateService.localNodeId), !topology.nodes.isEmpty {
                TopologyMiniView(topology: topology)
            }
        }
@@ -149,10 +83,8 @@ struct ContentView: View {
                VStack(alignment: .leading, spacing: 4) {
                    HStack {
                        VStack(alignment: .leading) {
-                            Text(
-                                "\(overview.usedRam, specifier: "%.0f") / \(overview.totalRam, specifier: "%.0f") GB"
-                            )
-                            .font(.headline)
+                            Text("\(overview.usedRam, specifier: "%.0f") / \(overview.totalRam, specifier: "%.0f") GB")
+                                .font(.headline)
                            Text("Memory")
                                .font(.caption)
                                .foregroundColor(.secondary)
@@ -261,7 +193,11 @@ struct ContentView: View {
                Divider()
                    .padding(.vertical, 4)
            }
-            advancedSection
+            controlButton(title: "Check for Updates") {
+                updater.checkForUpdates()
+            }
+            .padding(.bottom, 8)
+            debugSection
                .padding(.bottom, 8)
            controlButton(title: "Quit", tint: .secondary) {
                controller.stop()
@@ -270,57 +206,7 @@ struct ContentView: View {
        }
    }

-    private var advancedSection: some View {
-        VStack(alignment: .leading, spacing: 6) {
-            HStack {
-                Text("Advanced")
-                    .font(.caption)
-                    .foregroundColor(.secondary)
-                Spacer()
-                collapseButton(isExpanded: $showAdvanced)
-            }
-            .animation(nil, value: showAdvanced)
-            if showAdvanced {
-                VStack(alignment: .leading, spacing: 8) {
-                    VStack(alignment: .leading, spacing: 4) {
-                        Text("Cluster Namespace")
-                            .font(.caption2)
-                            .foregroundColor(.secondary)
-                        HStack {
-                            TextField("optional", text: $pendingNamespace)
-                                .textFieldStyle(.roundedBorder)
-                                .font(.caption2)
-                                .onAppear {
-                                    pendingNamespace = controller.customNamespace
-                                }
-                            Button("Save & Restart") {
-                                controller.customNamespace = pendingNamespace
-                                if controller.status == .running || controller.status == .starting {
-                                    controller.restart()
-                                }
-                            }
-                            .font(.caption2)
-                            .disabled(pendingNamespace == controller.customNamespace)
-                        }
-                    }
-                    HoverButton(title: "Check for Updates", small: true) {
-                        updater.checkForUpdates()
-                    }
-                    debugSection
-                    HoverButton(title: "Uninstall", tint: .red, small: true) {
-                        showUninstallConfirmationAlert()
-                    }
-                    .disabled(uninstallInProgress)
-                }
-                .transition(.opacity)
-            }
-        }
-        .animation(.easeInOut(duration: 0.25), value: showAdvanced)
-    }
-
-    private func controlButton(title: String, tint: Color = .primary, action: @escaping () -> Void)
-        -> some View
-    {
+    private func controlButton(title: String, tint: Color = .primary, action: @escaping () -> Void) -> some View {
        HoverButton(title: title, tint: tint, trailingSystemImage: nil, action: action)
    }

@@ -351,12 +237,9 @@ struct ContentView: View {
        Button {
            isExpanded.wrappedValue.toggle()
        } label: {
-            Label(
-                isExpanded.wrappedValue ? "Hide" : "Show All",
-                systemImage: isExpanded.wrappedValue ? "chevron.up" : "chevron.down"
-            )
-            .labelStyle(.titleAndIcon)
-            .contentTransition(.symbolEffect(.replace))
+            Label(isExpanded.wrappedValue ? "Hide" : "Show All", systemImage: isExpanded.wrappedValue ? "chevron.up" : "chevron.down")
+                .labelStyle(.titleAndIcon)
+                .contentTransition(.symbolEffect(.replace))
        }
        .buttonStyle(.plain)
        .font(.caption2)
@@ -445,15 +328,15 @@ struct ContentView: View {
    }

    private var debugSection: some View {
-        VStack(alignment: .leading, spacing: 4) {
-            HoverButton(
-                title: "Debug Info",
-                tint: .primary,
-                trailingSystemImage: showDebugInfo ? "chevron.up" : "chevron.down",
-                small: true
-            ) {
-                showDebugInfo.toggle()
+        VStack(alignment: .leading, spacing: 6) {
+            HStack {
+                Text("Debug Info")
+                    .font(.caption)
+                    .foregroundColor(.secondary)
+                Spacer()
+                collapseButton(isExpanded: $showDebugInfo)
            }
+            .animation(nil, value: showDebugInfo)
            if showDebugInfo {
                VStack(alignment: .leading, spacing: 4) {
                    Text("Version: \(buildTag)")
@@ -466,63 +349,15 @@ struct ContentView: View {
                        .font(.caption2)
                        .foregroundColor(thunderboltStatusColor)
                    interfaceIpList
-                    rdmaStatusView
                    sendBugReportButton
                        .padding(.top, 6)
                }
-                .padding(.leading, 8)
                .transition(.opacity)
            }
        }
        .animation(.easeInOut(duration: 0.25), value: showDebugInfo)
    }

-    private var rdmaStatusView: some View {
-        let rdma = networkStatusService.status.rdmaStatus
-        return VStack(alignment: .leading, spacing: 1) {
-            Text("RDMA: \(rdmaStatusText(rdma))")
-                .font(.caption2)
-                .foregroundColor(rdmaStatusColor(rdma))
-            if !rdma.devices.isEmpty {
-                Text("  Devices: \(rdma.devices.joined(separator: ", "))")
-                    .font(.caption2)
-                    .foregroundColor(.secondary)
-            }
-            if !rdma.activePorts.isEmpty {
-                Text("  Active Ports:")
-                    .font(.caption2)
-                    .foregroundColor(.secondary)
-                ForEach(rdma.activePorts, id: \.device) { port in
-                    Text("    \(port.device) port \(port.port): \(port.state)")
-                        .font(.caption2)
-                        .foregroundColor(.green)
-                }
-            }
-        }
-    }
-
-    private func rdmaStatusText(_ rdma: RDMAStatus) -> String {
-        switch rdma.rdmaCtlEnabled {
-        case .some(true):
-            return "Enabled"
-        case .some(false):
-            return "Disabled"
-        case nil:
-            return rdma.devices.isEmpty ? "Not Available" : "Available"
-        }
-    }
-
-    private func rdmaStatusColor(_ rdma: RDMAStatus) -> Color {
-        switch rdma.rdmaCtlEnabled {
-        case .some(true):
-            return .green
-        case .some(false):
-            return .orange
-        case nil:
-            return rdma.devices.isEmpty ? .secondary : .green
-        }
-    }
-
    private var sendBugReportButton: some View {
        VStack(alignment: .leading, spacing: 4) {
            Button {
@@ -612,88 +447,6 @@ struct ContentView: View {
        bugReportInFlight = false
    }

-    private func showUninstallConfirmationAlert() {
-        let alert = NSAlert()
-        alert.messageText = "Uninstall EXO"
-        alert.informativeText = """
-            This will remove EXO and all its system components:
-
-            • Network configuration daemon
-            • Launch at login registration
-            • EXO network location
-
-            The app will be moved to Trash.
-            """
-        alert.alertStyle = .warning
-        alert.addButton(withTitle: "Uninstall")
-        alert.addButton(withTitle: "Cancel")
-
-        // Style the Uninstall button as destructive
-        if let uninstallButton = alert.buttons.first {
-            uninstallButton.hasDestructiveAction = true
-        }
-
-        let response = alert.runModal()
-        if response == .alertFirstButtonReturn {
-            performUninstall()
-        }
-    }
-
-    private func performUninstall() {
-        uninstallInProgress = true
-
-        // Stop EXO process first
-        controller.cancelPendingLaunch()
-        controller.stop()
-        stateService.stopPolling()
-
-        // Run the privileged uninstall on a background thread
-        // Using .utility QoS to avoid priority inversion with NSAppleScript's subprocess
-        DispatchQueue.global(qos: .utility).async {
-            do {
-                // Remove network setup daemon and components (requires admin privileges)
-                try NetworkSetupHelper.uninstall()
-
-                DispatchQueue.main.async {
-                    // Unregister from launch at login
-                    LaunchAtLoginHelper.disable()
-
-                    // Move app to trash
-                    self.moveAppToTrash()
-
-                    // Quit the app
-                    DispatchQueue.main.asyncAfter(deadline: .now() + 0.5) {
-                        NSApplication.shared.terminate(nil)
-                    }
-                }
-            } catch {
-                DispatchQueue.main.async {
-                    self.showErrorAlert(message: error.localizedDescription)
-                    self.uninstallInProgress = false
-                }
-            }
-        }
-    }
-
-    private func showErrorAlert(message: String) {
-        let alert = NSAlert()
-        alert.messageText = "Uninstall Failed"
-        alert.informativeText = message
-        alert.alertStyle = .critical
-        alert.addButton(withTitle: "OK")
-        alert.runModal()
-    }
-
-    private func moveAppToTrash() {
-        guard let appURL = Bundle.main.bundleURL as URL? else { return }
-        do {
-            try FileManager.default.trashItem(at: appURL, resultingItemURL: nil)
-        } catch {
-            // If we can't trash the app, that's OK - user can do it manually
-            // The important system components have already been cleaned up
-        }
-    }
-
    private var buildTag: String {
        Bundle.main.infoDictionary?["EXOBuildTag"] as? String ?? "unknown"
    }
@@ -707,27 +460,14 @@ private struct HoverButton: View {
    let title: String
    let tint: Color
    let trailingSystemImage: String?
-    let small: Bool
    let action: () -> Void

-    init(
-        title: String, tint: Color = .primary, trailingSystemImage: String? = nil,
-        small: Bool = false, action: @escaping () -> Void
-    ) {
-        self.title = title
-        self.tint = tint
-        self.trailingSystemImage = trailingSystemImage
-        self.small = small
-        self.action = action
-    }
-
    @State private var isHovering = false

    var body: some View {
        Button(action: action) {
            HStack {
                Text(title)
-                    .font(small ? .caption : nil)
                Spacer()
                if let systemName = trailingSystemImage {
                    Image(systemName: systemName)
@@ -735,8 +475,8 @@ private struct HoverButton: View {
                }
            }
            .frame(maxWidth: .infinity, alignment: .leading)
-            .padding(.vertical, small ? 4 : 6)
-            .padding(.horizontal, small ? 6 : 8)
+            .padding(.vertical, 6)
+            .padding(.horizontal, 8)
            .background(
                RoundedRectangle(cornerRadius: 6)
                    .fill(
@@ -751,3 +491,4 @@ private struct HoverButton: View {
        .onHover { isHovering = $0 }
    }
 }
+
--- a/app/EXO/EXO/EXOApp.swift
+++ b/app/EXO/EXO/EXOApp.swift
@@ -8,9 +8,9 @@
 import AppKit
 import CoreImage
 import CoreImage.CIFilterBuiltins
-import ServiceManagement
 import Sparkle
 import SwiftUI
+import ServiceManagement
 import UserNotifications
 import os.log

@@ -19,7 +19,6 @@ struct EXOApp: App {
    @StateObject private var controller: ExoProcessController
    @StateObject private var stateService: ClusterStateService
    @StateObject private var networkStatusService: NetworkStatusService
-    @StateObject private var localNetworkChecker: LocalNetworkChecker
    @StateObject private var updater: SparkleUpdater
    private let terminationObserver: TerminationObserver
    private let ciContext = CIContext(options: nil)
@@ -38,13 +37,9 @@ struct EXOApp: App {
        _stateService = StateObject(wrappedValue: service)
        let networkStatus = NetworkStatusService()
        _networkStatusService = StateObject(wrappedValue: networkStatus)
-        let localNetwork = LocalNetworkChecker()
-        _localNetworkChecker = StateObject(wrappedValue: localNetwork)
        _updater = StateObject(wrappedValue: updater)
        enableLaunchAtLoginIfNeeded()
        NetworkSetupHelper.ensureLaunchDaemonInstalled()
-        // Check local network access BEFORE launching exo
-        localNetwork.check()
        controller.scheduleLaunch(after: 15)
        service.startPolling()
        networkStatus.startPolling()
@@ -56,7 +51,6 @@ struct EXOApp: App {
                .environmentObject(controller)
                .environmentObject(stateService)
                .environmentObject(networkStatusService)
-                .environmentObject(localNetworkChecker)
                .environmentObject(updater)
        } label: {
            menuBarIcon
@@ -113,7 +107,7 @@ struct EXOApp: App {
        filter.contrast = 0.9

        guard let output = filter.outputImage,
-            let rendered = ciContext.createCGImage(output, from: output.extent)
+              let rendered = ciContext.createCGImage(output, from: output.extent)
        else {
            return nil
        }
@@ -126,26 +120,7 @@ struct EXOApp: App {
        do {
            try SMAppService.mainApp.register()
        } catch {
-            Logger().error(
-                "Failed to register EXO for launch at login: \(error.localizedDescription)")
-        }
-    }
-}
-
-/// Helper for managing EXO's launch-at-login registration
-enum LaunchAtLoginHelper {
-    private static let logger = Logger(subsystem: "io.exo.EXO", category: "LaunchAtLogin")
-
-    /// Unregisters EXO from launching at login
-    static func disable() {
-        guard SMAppService.mainApp.status == .enabled else { return }
-        do {
-            try SMAppService.mainApp.unregister()
-            logger.info("Unregistered EXO from launch at login")
-        } catch {
-            logger.error(
-                "Failed to unregister EXO from launch at login: \(error.localizedDescription, privacy: .public)"
-            )
+            Logger().error("Failed to register EXO for launch at login: \(error.localizedDescription)")
        }
    }
 }
@@ -170,7 +145,7 @@ final class SparkleUpdater: NSObject, ObservableObject {
        center.requestAuthorization(options: [.alert, .sound]) { _, _ in }
        controller.updater.automaticallyChecksForUpdates = true
        controller.updater.automaticallyDownloadsUpdates = false
-        controller.updater.updateCheckInterval = 900  // 15 minutes
+        controller.updater.updateCheckInterval = 900 // 15 minutes
        DispatchQueue.main.asyncAfter(deadline: .now() + 5) { [weak controller] in
            controller?.updater.checkForUpdatesInBackground()
        }
@@ -237,8 +212,7 @@ private final class ExoNotificationDelegate: NSObject, UNUserNotificationCenterD
    func userNotificationCenter(
        _ center: UNUserNotificationCenter,
        willPresent notification: UNNotification,
-        withCompletionHandler completionHandler: @escaping (UNNotificationPresentationOptions) ->
-            Void
+        withCompletionHandler completionHandler: @escaping (UNNotificationPresentationOptions) -> Void
    ) {
        completionHandler([.banner, .list, .sound])
    }
--- a/app/EXO/EXO/ExoProcessController.swift
+++ b/app/EXO/EXO/ExoProcessController.swift
@@ -2,8 +2,6 @@ import AppKit
 import Combine
 import Foundation

-private let customNamespaceKey = "EXOCustomNamespace"
-
@MainActor
 final class ExoProcessController: ObservableObject {
    enum Status: Equatable {
@@ -29,14 +27,6 @@ final class ExoProcessController: ObservableObject {
    @Published private(set) var status: Status = .stopped
    @Published private(set) var lastError: String?
    @Published private(set) var launchCountdownSeconds: Int?
-    @Published var customNamespace: String = {
-        return UserDefaults.standard.string(forKey: customNamespaceKey) ?? ""
-    }()
-    {
-        didSet {
-            UserDefaults.standard.set(customNamespace, forKey: customNamespaceKey)
-        }
-    }

    private var process: Process?
    private var runtimeDirectoryURL: URL?
@@ -190,7 +180,7 @@ final class ExoProcessController: ObservableObject {
    private func makeEnvironment(for runtimeURL: URL) -> [String: String] {
        var environment = ProcessInfo.processInfo.environment
        environment["EXO_RUNTIME_DIR"] = runtimeURL.path
-        environment["EXO_LIBP2P_NAMESPACE"] = computeNamespace()
+        environment["EXO_LIBP2P_NAMESPACE"] = buildTag()

        var paths: [String] = []
        if let existing = environment["PATH"], !existing.isEmpty {
@@ -222,19 +212,11 @@ final class ExoProcessController: ObservableObject {
        if let tag = Bundle.main.infoDictionary?["EXOBuildTag"] as? String, !tag.isEmpty {
            return tag
        }
-        if let short = Bundle.main.infoDictionary?["CFBundleShortVersionString"] as? String,
-            !short.isEmpty
-        {
+        if let short = Bundle.main.infoDictionary?["CFBundleShortVersionString"] as? String, !short.isEmpty {
            return short
        }
        return "dev"
    }
-
-    private func computeNamespace() -> String {
-        let base = buildTag()
-        let custom = customNamespace.trimmingCharacters(in: .whitespaces)
-        return custom.isEmpty ? base : custom
-    }
 }

 struct RuntimeError: LocalizedError {
--- a/app/EXO/EXO/Info.plist
+++ b/app/EXO/EXO/Info.plist
@@ -8,15 +8,5 @@
 	<string>$(EXO_BUILD_TAG)</string>
 	<key>EXOBuildCommit</key>
 	<string>$(EXO_BUILD_COMMIT)</string>
-	<key>EXOBugReportPresignedUrlEndpoint</key>
-	<string>$(EXO_BUG_REPORT_PRESIGNED_URL_ENDPOINT)</string>
-	<key>NSLocalNetworkUsageDescription</key>
-	<string>EXO needs local network access to discover and connect to other devices in your cluster for distributed AI inference.</string>
-	<key>NSBonjourServices</key>
-	<array>
-		<string>_p2p._tcp</string>
-		<string>_p2p._udp</string>
-		<string>_libp2p._udp</string>
-	</array>
 </dict>
 </plist>
--- a/app/EXO/EXO/Models/ClusterState.swift
+++ b/app/EXO/EXO/Models/ClusterState.swift
@@ -16,13 +16,10 @@ struct ClusterState: Decodable {
        self.instances = rawInstances.mapValues(\.instance)
        self.runners = try container.decode([String: RunnerStatusSummary].self, forKey: .runners)
        self.nodeProfiles = try container.decode([String: NodeProfile].self, forKey: .nodeProfiles)
-        let rawTasks =
-            try container.decodeIfPresent([String: TaggedTask].self, forKey: .tasks) ?? [:]
+        let rawTasks = try container.decodeIfPresent([String: TaggedTask].self, forKey: .tasks) ?? [:]
        self.tasks = rawTasks.compactMapValues(\.task)
        self.topology = try container.decodeIfPresent(Topology.self, forKey: .topology)
-        let rawDownloads =
-            try container.decodeIfPresent([String: [TaggedNodeDownload]].self, forKey: .downloads)
-            ?? [:]
+        let rawDownloads = try container.decodeIfPresent([String: [TaggedNodeDownload]].self, forKey: .downloads) ?? [:]
        self.downloads = rawDownloads.mapValues { $0.compactMap(\.status) }
    }

@@ -44,8 +41,7 @@ private struct TaggedInstance: Decodable {
        let payloads = try container.decode([String: ClusterInstancePayload].self)
        guard let entry = payloads.first else {
            throw DecodingError.dataCorrupted(
-                DecodingError.Context(
-                    codingPath: decoder.codingPath, debugDescription: "Empty instance payload")
+                DecodingError.Context(codingPath: decoder.codingPath, debugDescription: "Empty instance payload")
            )
        }
        self.instance = ClusterInstance(
@@ -81,8 +77,7 @@ struct RunnerStatusSummary: Decodable {
        let payloads = try container.decode([String: RunnerStatusDetail].self)
        guard let entry = payloads.first else {
            throw DecodingError.dataCorrupted(
-                DecodingError.Context(
-                    codingPath: decoder.codingPath, debugDescription: "Empty runner status payload")
+                DecodingError.Context(codingPath: decoder.codingPath, debugDescription: "Empty runner status payload")
            )
        }
        self.status = entry.key
@@ -262,9 +257,7 @@ struct ChatCompletionTaskParameters: Decodable, Equatable {

    func promptPreview() -> String? {
        guard let messages else { return nil }
-        if let userMessage = messages.last(where: {
-            $0.role?.lowercased() == "user" && ($0.content?.isEmpty == false)
-        }) {
+        if let userMessage = messages.last(where: { $0.role?.lowercased() == "user" && ($0.content?.isEmpty == false) }) {
            return userMessage.content
        }
        return messages.last?.content
@@ -372,3 +365,5 @@ extension ClusterState {

    func availableModels() -> [ModelOption] { [] }
 }
+
+
--- a/app/EXO/EXO/Services/BugReportService.swift
+++ b/app/EXO/EXO/Services/BugReportService.swift
@@ -1,3 +1,4 @@
+import CryptoKit
 import Foundation

 struct BugReportOutcome: Equatable {
@@ -6,17 +7,17 @@ struct BugReportOutcome: Equatable {
 }

 enum BugReportError: LocalizedError {
+    case missingCredentials
    case invalidEndpoint
-    case presignedUrlFailed(String)
    case uploadFailed(String)
    case collectFailed(String)

    var errorDescription: String? {
        switch self {
+        case .missingCredentials:
+            return "Bug report upload credentials are not set."
        case .invalidEndpoint:
            return "Bug report endpoint is invalid."
-        case .presignedUrlFailed(let message):
-            return "Failed to get presigned URLs: \(message)"
        case .uploadFailed(let message):
            return "Bug report upload failed: \(message)"
        case .collectFailed(let message):
@@ -26,13 +27,11 @@ enum BugReportError: LocalizedError {
 }

 struct BugReportService {
-    private struct PresignedUrlsRequest: Codable {
-        let keys: [String]
-    }
-
-    private struct PresignedUrlsResponse: Codable {
-        let urls: [String: String]
-        let expiresIn: Int?
+    struct AWSConfig {
+        let accessKey: String
+        let secretKey: String
+        let region: String
+        let bucket: String
    }

    func sendReport(
@@ -40,9 +39,9 @@ struct BugReportService {
        now: Date = Date(),
        isManual: Bool = false
    ) async throws -> BugReportOutcome {
-        let timestamp = Self.runTimestampString(now)
-        let dayPrefix = Self.dayPrefixString(now)
-        let prefix = "reports/\(dayPrefix)/\(timestamp)/"
+        let credentials = try loadCredentials()
+        let timestamp = ISO8601DateFormatter().string(from: now)
+        let prefix = "reports/\(timestamp)/"

        let logData = readLog()
        let ifconfigText = try await captureIfconfig()
@@ -67,82 +66,28 @@ struct BugReportService {
            ("\(prefix)exo.log", logData),
            ("\(prefix)state.json", stateData),
            ("\(prefix)events.json", eventsData),
-            ("\(prefix)report.json", reportJSON),
+            ("\(prefix)report.json", reportJSON)
        ]

-        let uploadItems: [(key: String, body: Data)] = uploads.compactMap { item in
-            guard let body = item.data else { return nil }
-            return (key: item.path, body: body)
+        let uploader = try S3Uploader(config: credentials)
+        for item in uploads {
+            guard let data = item.data else { continue }
+            try await uploader.upload(
+                objectPath: item.path,
+                body: data
+            )
        }

-        guard !uploadItems.isEmpty else {
-            return BugReportOutcome(success: false, message: "No data to upload")
-        }
-
-        let presignedUrls = try await fetchPresignedUploadUrls(keys: uploadItems.map(\.key))
-        for item in uploadItems {
-            guard let urlString = presignedUrls[item.key], let url = URL(string: urlString) else {
-                throw BugReportError.uploadFailed("Missing presigned URL for \(item.key)")
-            }
-            try await uploadToPresignedUrl(url: url, body: item.body)
-        }
-
-        return BugReportOutcome(
-            success: true, message: "Bug Report sent. Thank you for helping to improve EXO 1.0.")
+        return BugReportOutcome(success: true, message: "Bug Report sent. Thank you for helping to improve EXO 1.0.")
    }

-    private static func dayPrefixString(_ date: Date) -> String {
-        var calendar = Calendar(identifier: .gregorian)
-        calendar.timeZone = TimeZone(secondsFromGMT: 0) ?? .current
-        let components = calendar.dateComponents([.year, .month, .day], from: date)
-        let year = components.year ?? 0
-        let month = components.month ?? 0
-        let day = components.day ?? 0
-        return String(format: "%04d/%02d/%02d", year, month, day)
-    }
-
-    private static func runTimestampString(_ date: Date) -> String {
-        let formatter = DateFormatter()
-        formatter.locale = Locale(identifier: "en_US_POSIX")
-        formatter.timeZone = TimeZone(secondsFromGMT: 0) ?? .current
-        formatter.dateFormat = "yyyy-MM-dd'T'HHmmss.SSS'Z'"
-        return formatter.string(from: date)
-    }
-
-    private func fetchPresignedUploadUrls(keys: [String], bundle: Bundle = .main) async throws
-        -> [String: String]
-    {
-        guard
-            let endpointString = bundle.infoDictionary?["EXOBugReportPresignedUrlEndpoint"]
-                as? String
-        else {
-            throw BugReportError.invalidEndpoint
-        }
-        let trimmedEndpointString = endpointString.trimmingCharacters(in: .whitespacesAndNewlines)
-        guard !trimmedEndpointString.isEmpty, let endpoint = URL(string: trimmedEndpointString)
-        else {
-            throw BugReportError.invalidEndpoint
-        }
-
-        var request = URLRequest(url: endpoint)
-        request.httpMethod = "POST"
-        request.timeoutInterval = 10
-        request.setValue("application/json", forHTTPHeaderField: "Content-Type")
-
-        let encoder = JSONEncoder()
-        request.httpBody = try encoder.encode(PresignedUrlsRequest(keys: keys))
-
-        let (data, response) = try await URLSession.shared.data(for: request)
-        guard let http = response as? HTTPURLResponse else {
-            throw BugReportError.presignedUrlFailed("Non-HTTP response")
-        }
-        guard (200..<300).contains(http.statusCode) else {
-            throw BugReportError.presignedUrlFailed("HTTP status \(http.statusCode)")
-        }
-
-        let decoder = JSONDecoder()
-        let decoded = try decoder.decode(PresignedUrlsResponse.self, from: data)
-        return decoded.urls
+    private func loadCredentials() throws -> AWSConfig {
+        return AWSConfig(
+            accessKey: "AKIAYEKP5EMXTOBYDGHX",
+            secretKey: "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",
+            region: "us-east-1",
+            bucket: "exo-bug-reports"
+        )
    }

    private func readLog() -> Data? {
@@ -155,8 +100,7 @@ struct BugReportService {
    private func captureIfconfig() async throws -> String {
        let result = runCommand(["/sbin/ifconfig"])
        guard result.exitCode == 0 else {
-            throw BugReportError.collectFailed(
-                result.error.isEmpty ? "ifconfig failed" : result.error)
+            throw BugReportError.collectFailed(result.error.isEmpty ? "ifconfig failed" : result.error)
        }
        return result.output
    }
@@ -164,23 +108,12 @@ struct BugReportService {
    private func readDebugInfo() -> DebugInfo {
        DebugInfo(
            thunderboltBridgeDisabled: readThunderboltBridgeDisabled(),
-            interfaces: readInterfaces(),
-            rdma: readRDMADebugInfo()
-        )
-    }
-
-    private func readRDMADebugInfo() -> DebugInfo.RDMADebugInfo {
-        DebugInfo.RDMADebugInfo(
-            rdmaCtlStatus: safeRunCommand(["/usr/bin/rdma_ctl", "status"]),
-            ibvDevices: safeRunCommand(["/usr/bin/ibv_devices"]),
-            ibvDevinfo: safeRunCommand(["/usr/bin/ibv_devinfo"])
+            interfaces: readInterfaces()
        )
    }

    private func readThunderboltBridgeDisabled() -> Bool? {
-        let result = runCommand([
-            "/usr/sbin/networksetup", "-getnetworkserviceenabled", "Thunderbolt Bridge",
-        ])
+        let result = runCommand(["/usr/sbin/networksetup", "-getnetworkserviceenabled", "Thunderbolt Bridge"])
        guard result.exitCode == 0 else { return nil }
        let output = result.output.lowercased()
        if output.contains("enabled") {
@@ -223,8 +156,7 @@ struct BugReportService {
        request.timeoutInterval = 5
        do {
            let (data, response) = try await URLSession.shared.data(for: request)
-            guard let http = response as? HTTPURLResponse, (200..<300).contains(http.statusCode)
-            else {
+            guard let http = response as? HTTPURLResponse, (200..<300).contains(http.statusCode) else {
                return nil
            }
            return data
@@ -233,36 +165,6 @@ struct BugReportService {
        }
    }

-    private func uploadToPresignedUrl(url: URL, body: Data) async throws {
-        let maxAttempts = 2
-        var lastError: Error?
-
-        for attempt in 1...maxAttempts {
-            do {
-                var request = URLRequest(url: url)
-                request.httpMethod = "PUT"
-                request.httpBody = body
-                request.timeoutInterval = 30
-
-                let (_, response) = try await URLSession.shared.data(for: request)
-                guard let http = response as? HTTPURLResponse else {
-                    throw BugReportError.uploadFailed("Non-HTTP response")
-                }
-                guard (200..<300).contains(http.statusCode) else {
-                    throw BugReportError.uploadFailed("HTTP status \(http.statusCode)")
-                }
-                return
-            } catch {
-                lastError = error
-                if attempt < maxAttempts {
-                    try await Task.sleep(nanoseconds: 400_000_000)
-                }
-            }
-        }
-
-        throw BugReportError.uploadFailed(lastError?.localizedDescription ?? "Unknown error")
-    }
-
    private func makeReportJson(
        timestamp: String,
        hostName: String,
@@ -280,7 +182,7 @@ struct BugReportService {
            "system": system,
            "exo_version": exo.version as Any,
            "exo_commit": exo.commit as Any,
-            "report_type": isManual ? "manual" : "automated",
+            "report_type": isManual ? "manual" : "automated"
        ]
        return try? JSONSerialization.data(withJSONObject: payload, options: [.prettyPrinted])
    }
@@ -311,13 +213,10 @@ struct BugReportService {
        let user = safeRunCommand(["/usr/bin/whoami"])
        let consoleUser = safeRunCommand(["/usr/bin/stat", "-f%Su", "/dev/console"])
        let uptime = safeRunCommand(["/usr/bin/uptime"])
-        let diskRoot = safeRunCommand([
-            "/bin/sh", "-c", "/bin/df -h / | awk 'NR==2 {print $1, $2, $3, $4, $5}'",
-        ])
+        let diskRoot = safeRunCommand(["/bin/sh", "-c", "/bin/df -h / | awk 'NR==2 {print $1, $2, $3, $4, $5}'"])

        let interfacesList = safeRunCommand(["/usr/sbin/ipconfig", "getiflist"])
-        let interfacesAndIPs =
-            interfacesList?
+        let interfacesAndIPs = interfacesList?
            .split(whereSeparator: { $0 == " " || $0 == "\n" })
            .compactMap { iface -> [String: Any]? in
                let name = String(iface)
@@ -328,8 +227,7 @@ struct BugReportService {
            } ?? []

        let wifiSSID: String?
-        let airportPath =
-            "/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport"
+        let airportPath = "/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport"
        if FileManager.default.isExecutableFile(atPath: airportPath) {
            wifiSSID = safeRunCommand([airportPath, "-I"]).flatMap(parseWifiSSID)
        } else {
@@ -357,7 +255,7 @@ struct BugReportService {
            "disk_root": diskRoot as Any,
            "interfaces_and_ips": interfacesAndIPs,
            "ipconfig_getiflist": interfacesList as Any,
-            "wifi_ssid": wifiSSID as Any,
+            "wifi_ssid": wifiSSID as Any
        ]
    }

@@ -415,8 +313,7 @@ struct BugReportService {
        for line in airportOutput.split(separator: "\n") {
            let trimmed = line.trimmingCharacters(in: .whitespaces)
            if trimmed.hasPrefix("SSID:") {
-                return trimmed.replacingOccurrences(of: "SSID:", with: "").trimmingCharacters(
-                    in: .whitespaces)
+                return trimmed.replacingOccurrences(of: "SSID:", with: "").trimmingCharacters(in: .whitespaces)
            }
        }
        return nil
@@ -453,7 +350,6 @@ struct BugReportService {
 private struct DebugInfo {
    let thunderboltBridgeDisabled: Bool?
    let interfaces: [InterfaceStatus]
-    let rdma: RDMADebugInfo

    struct InterfaceStatus {
        let name: String
@@ -462,21 +358,7 @@ private struct DebugInfo {
        func toDictionary() -> [String: Any] {
            [
                "name": name,
-                "ip": ip as Any,
-            ]
-        }
-    }
-
-    struct RDMADebugInfo {
-        let rdmaCtlStatus: String?
-        let ibvDevices: String?
-        let ibvDevinfo: String?
-
-        func toDictionary() -> [String: Any] {
-            [
-                "rdma_ctl_status": rdmaCtlStatus as Any,
-                "ibv_devices": ibvDevices as Any,
-                "ibv_devinfo": ibvDevinfo as Any,
+                "ip": ip as Any
            ]
        }
    }
@@ -484,8 +366,7 @@ private struct DebugInfo {
    func toDictionary() -> [String: Any] {
        [
            "thunderbolt_bridge_disabled": thunderboltBridgeDisabled as Any,
-            "interfaces": interfaces.map { $0.toDictionary() },
-            "rdma": rdma.toDictionary(),
+            "interfaces": interfaces.map { $0.toDictionary() }
        ]
    }
 }
@@ -495,3 +376,163 @@ private struct CommandResult {
    let output: String
    let error: String
 }
+
+private struct S3Uploader {
+    let config: BugReportService.AWSConfig
+
+    init(config: BugReportService.AWSConfig) throws {
+        self.config = config
+    }
+
+    func upload(objectPath: String, body: Data) async throws {
+        let host = "\(config.bucket).s3.amazonaws.com"
+        guard let url = URL(string: "https://\(host)/\(objectPath)") else {
+            throw BugReportError.invalidEndpoint
+        }
+
+        let now = Date()
+        let amzDate = awsTimestamp(now)
+        let dateStamp = dateStamp(now)
+        let payloadHash = sha256Hex(body)
+
+        let headers = [
+            "host": host,
+            "x-amz-content-sha256": payloadHash,
+            "x-amz-date": amzDate
+        ]
+
+        let canonicalRequest = buildCanonicalRequest(
+            method: "PUT",
+            url: url,
+            headers: headers,
+            payloadHash: payloadHash
+        )
+
+        let stringToSign = buildStringToSign(
+            amzDate: amzDate,
+            dateStamp: dateStamp,
+            canonicalRequestHash: sha256Hex(canonicalRequest.data(using: .utf8) ?? Data())
+        )
+
+        let signingKey = deriveKey(secret: config.secretKey, dateStamp: dateStamp, region: config.region, service: "s3")
+        let signature = hmacHex(key: signingKey, data: Data(stringToSign.utf8))
+
+        let signedHeaders = "host;x-amz-content-sha256;x-amz-date"
+        let authorization = """
+AWS4-HMAC-SHA256 Credential=\(config.accessKey)/\(dateStamp)/\(config.region)/s3/aws4_request, SignedHeaders=\(signedHeaders), Signature=\(signature)
+"""
+
+        var request = URLRequest(url: url)
+        request.httpMethod = "PUT"
+        request.httpBody = body
+        request.setValue(headers["x-amz-content-sha256"], forHTTPHeaderField: "x-amz-content-sha256")
+        request.setValue(headers["x-amz-date"], forHTTPHeaderField: "x-amz-date")
+        request.setValue(host, forHTTPHeaderField: "Host")
+        request.setValue(authorization, forHTTPHeaderField: "Authorization")
+
+        let (data, response) = try await URLSession.shared.data(for: request)
+        guard let http = response as? HTTPURLResponse, (200..<300).contains(http.statusCode) else {
+            let statusText = (response as? HTTPURLResponse)?.statusCode ?? -1
+            _ = data // ignore response body for UX
+            throw BugReportError.uploadFailed("HTTP status \(statusText)")
+        }
+    }
+
+    private func buildCanonicalRequest(
+        method: String,
+        url: URL,
+        headers: [String: String],
+        payloadHash: String
+    ) -> String {
+        let canonicalURI = encodePath(url.path)
+        let canonicalQuery = url.query ?? ""
+        let sortedHeaders = headers.sorted { $0.key < $1.key }
+        let canonicalHeaders = sortedHeaders
+            .map { "\($0.key.lowercased()):\($0.value)\n" }
+            .joined()
+        let signedHeaders = sortedHeaders.map { $0.key.lowercased() }.joined(separator: ";")
+
+        return [
+            method,
+            canonicalURI,
+            canonicalQuery,
+            canonicalHeaders,
+            signedHeaders,
+            payloadHash
+        ].joined(separator: "\n")
+    }
+
+    private func encodePath(_ path: String) -> String {
+        return path
+            .split(separator: "/")
+            .map { segment in
+                segment.addingPercentEncoding(withAllowedCharacters: Self.rfc3986) ?? String(segment)
+            }
+            .joined(separator: "/")
+            .prependSlashIfNeeded()
+    }
+
+    private func buildStringToSign(
+        amzDate: String,
+        dateStamp: String,
+        canonicalRequestHash: String
+    ) -> String {
+        """
+AWS4-HMAC-SHA256
+\(amzDate)
+\(dateStamp)/\(config.region)/s3/aws4_request
+\(canonicalRequestHash)
+"""
+    }
+
+    private func deriveKey(secret: String, dateStamp: String, region: String, service: String) -> Data {
+        let kDate = hmac(key: Data(("AWS4" + secret).utf8), data: Data(dateStamp.utf8))
+        let kRegion = hmac(key: kDate, data: Data(region.utf8))
+        let kService = hmac(key: kRegion, data: Data(service.utf8))
+        return hmac(key: kService, data: Data("aws4_request".utf8))
+    }
+
+    private func hmac(key: Data, data: Data) -> Data {
+        let keySym = SymmetricKey(data: key)
+        let mac = HMAC<SHA256>.authenticationCode(for: data, using: keySym)
+        return Data(mac)
+    }
+
+    private func hmacHex(key: Data, data: Data) -> String {
+        hmac(key: key, data: data).map { String(format: "%02x", $0) }.joined()
+    }
+
+    private func sha256Hex(_ data: Data) -> String {
+        let digest = SHA256.hash(data: data)
+        return digest.compactMap { String(format: "%02x", $0) }.joined()
+    }
+
+    private func awsTimestamp(_ date: Date) -> String {
+        let formatter = DateFormatter()
+        formatter.dateFormat = "yyyyMMdd'T'HHmmss'Z'"
+        formatter.timeZone = TimeZone(abbreviation: "UTC")
+        return formatter.string(from: date)
+    }
+
+    private func dateStamp(_ date: Date) -> String {
+        let formatter = DateFormatter()
+        formatter.dateFormat = "yyyyMMdd"
+        formatter.timeZone = TimeZone(abbreviation: "UTC")
+        return formatter.string(from: date)
+    }
+
+    private static let rfc3986: CharacterSet = {
+        var set = CharacterSet.alphanumerics
+        set.insert(charactersIn: "-._~")
+        return set
+    }()
+}
+
+private extension String {
+    func prependSlashIfNeeded() -> String {
+        if hasPrefix("/") {
+            return self
+        }
+        return "/" + self
+    }
+}
--- a/app/EXO/EXO/Services/ClusterStateService.swift
+++ b/app/EXO/EXO/Services/ClusterStateService.swift
@@ -57,9 +57,7 @@ final class ClusterStateService: ObservableObject {
            var request = URLRequest(url: url)
            request.cachePolicy = .reloadIgnoringLocalCacheData
            let (data, response) = try await session.data(for: request)
-            guard let httpResponse = response as? HTTPURLResponse,
-                (200..<300).contains(httpResponse.statusCode)
-            else {
+            guard let httpResponse = response as? HTTPURLResponse, (200..<300).contains(httpResponse.statusCode) else {
                return
            }
            if let nodeId = try? decoder.decode(String.self, from: data) {
@@ -115,9 +113,7 @@ final class ClusterStateService: ObservableObject {
        }
    }

-    func launchInstance(modelId: String, sharding: String, instanceMeta: String, minNodes: Int)
-        async
-    {
+    func launchInstance(modelId: String, sharding: String, instanceMeta: String, minNodes: Int) async {
        do {
            var request = URLRequest(url: baseURL.appendingPathComponent("instance"))
            request.httpMethod = "POST"
@@ -126,7 +122,7 @@ final class ClusterStateService: ObservableObject {
                "model_id": modelId,
                "sharding": sharding,
                "instance_meta": instanceMeta,
-                "min_nodes": minNodes,
+                "min_nodes": minNodes
            ]
            request.httpBody = try JSONSerialization.data(withJSONObject: payload, options: [])
            let (_, response) = try await session.data(for: request)
@@ -147,9 +143,7 @@ final class ClusterStateService: ObservableObject {
        do {
            let url = baseURL.appendingPathComponent("models")
            let (data, response) = try await session.data(from: url)
-            guard let httpResponse = response as? HTTPURLResponse,
-                (200..<300).contains(httpResponse.statusCode)
-            else {
+            guard let httpResponse = response as? HTTPURLResponse, (200..<300).contains(httpResponse.statusCode) else {
                throw URLError(.badServerResponse)
            }
            let list = try decoder.decode(ModelListResponse.self, from: data)
--- a/app/EXO/EXO/Services/LocalNetworkChecker.swift
+++ b/app/EXO/EXO/Services/LocalNetworkChecker.swift
@@ -1,149 +0,0 @@
-import Foundation
-import Network
-import os.log
-
-/// Checks if the app's local network permission is actually functional.
-///
-/// macOS local network permission can appear enabled in System Preferences but not
-/// actually work after a restart. This service uses NWConnection to mDNS multicast
-/// to verify actual connectivity.
-@MainActor
-final class LocalNetworkChecker: ObservableObject {
-    enum Status: Equatable {
-        case unknown
-        case checking
-        case working
-        case notWorking(reason: String)
-
-        var isHealthy: Bool {
-            if case .working = self { return true }
-            return false
-        }
-
-        var displayText: String {
-            switch self {
-            case .unknown:
-                return "Unknown"
-            case .checking:
-                return "Checking..."
-            case .working:
-                return "Working"
-            case .notWorking(let reason):
-                return reason
-            }
-        }
-    }
-
-    private static let logger = Logger(subsystem: "io.exo.EXO", category: "LocalNetworkChecker")
-    private static let hasCompletedInitialCheckKey = "LocalNetworkChecker.hasCompletedInitialCheck"
-
-    @Published private(set) var status: Status = .unknown
-
-    private var connection: NWConnection?
-    private var checkTask: Task<Void, Never>?
-
-    /// Whether we've completed at least one check (stored in UserDefaults)
-    private var hasCompletedInitialCheck: Bool {
-        get { UserDefaults.standard.bool(forKey: Self.hasCompletedInitialCheckKey) }
-        set { UserDefaults.standard.set(newValue, forKey: Self.hasCompletedInitialCheckKey) }
-    }
-
-    /// Checks if local network access is working.
-    func check() {
-        checkTask?.cancel()
-        status = .checking
-
-        // Use longer timeout on first launch to allow time for permission prompt
-        let isFirstCheck = !hasCompletedInitialCheck
-        let timeout: UInt64 = isFirstCheck ? 30_000_000_000 : 3_000_000_000
-
-        checkTask = Task { [weak self] in
-            guard let self else { return }
-
-            Self.logger.info("Checking local network connectivity (first check: \(isFirstCheck))")
-            let result = await self.checkConnectivity(timeout: timeout)
-            self.status = result
-            self.hasCompletedInitialCheck = true
-
-            Self.logger.info("Local network check complete: \(result.displayText)")
-        }
-    }
-
-    /// Checks connectivity using NWConnection to mDNS multicast.
-    /// The connection attempt triggers the permission prompt if not yet shown.
-    private func checkConnectivity(timeout: UInt64) async -> Status {
-        connection?.cancel()
-        connection = nil
-
-        // mDNS multicast address - same as libp2p uses for peer discovery
-        let host = NWEndpoint.Host("224.0.0.251")
-        let port = NWEndpoint.Port(integerLiteral: 5353)
-
-        let params = NWParameters.udp
-        params.allowLocalEndpointReuse = true
-
-        let conn = NWConnection(host: host, port: port, using: params)
-        connection = conn
-
-        return await withCheckedContinuation { continuation in
-            var hasResumed = false
-            let lock = NSLock()
-
-            let resumeOnce: (Status) -> Void = { status in
-                lock.lock()
-                defer { lock.unlock() }
-                guard !hasResumed else { return }
-                hasResumed = true
-                continuation.resume(returning: status)
-            }
-
-            conn.stateUpdateHandler = { state in
-                switch state {
-                case .ready:
-                    resumeOnce(.working)
-                case .waiting(let error):
-                    let errorStr = "\(error)"
-                    if errorStr.contains("54") || errorStr.contains("ECONNRESET") {
-                        resumeOnce(.notWorking(reason: "Connection blocked"))
-                    }
-                // Otherwise keep waiting - might be showing permission prompt
-                case .failed(let error):
-                    let errorStr = "\(error)"
-                    if errorStr.contains("65") || errorStr.contains("EHOSTUNREACH")
-                        || errorStr.contains("permission") || errorStr.contains("denied")
-                    {
-                        resumeOnce(.notWorking(reason: "Permission denied"))
-                    } else {
-                        resumeOnce(.notWorking(reason: "Failed: \(error.localizedDescription)"))
-                    }
-                case .cancelled, .setup, .preparing:
-                    break
-                @unknown default:
-                    break
-                }
-            }
-
-            conn.start(queue: .main)
-
-            Task {
-                try? await Task.sleep(nanoseconds: timeout)
-                let state = conn.state
-                switch state {
-                case .ready:
-                    resumeOnce(.working)
-                case .waiting, .preparing, .setup:
-                    resumeOnce(.notWorking(reason: "Timeout (may be blocked)"))
-                default:
-                    resumeOnce(.notWorking(reason: "Timeout"))
-                }
-            }
-        }
-    }
-
-    func stop() {
-        checkTask?.cancel()
-        checkTask = nil
-        connection?.cancel()
-        connection = nil
-    }
-}
--- a/app/EXO/EXO/Services/NetworkSetupHelper.swift
+++ b/app/EXO/EXO/Services/NetworkSetupHelper.swift
@@ -5,66 +5,64 @@ import os.log
 enum NetworkSetupHelper {
    private static let logger = Logger(subsystem: "io.exo.EXO", category: "NetworkSetup")
    private static let daemonLabel = "io.exo.networksetup"
-    private static let scriptDestination =
-        "/Library/Application Support/EXO/disable_bridge_enable_dhcp.sh"
+    private static let scriptDestination = "/Library/Application Support/EXO/disable_bridge_enable_dhcp.sh"
    private static let plistDestination = "/Library/LaunchDaemons/io.exo.networksetup.plist"
    private static let requiredStartInterval: Int = 1791

    private static let setupScript = """
-        #!/usr/bin/env bash
+#!/usr/bin/env bash

-        set -euo pipefail
+set -euo pipefail

-        PREFS="/Library/Preferences/SystemConfiguration/preferences.plist"
+PREFS="/Library/Preferences/SystemConfiguration/preferences.plist"

-        # Remove bridge0 interface
-        ifconfig bridge0 &>/dev/null && {
-          ifconfig bridge0 | grep -q 'member' && {
-            ifconfig bridge0 | awk '/member/ {print $2}' | xargs -n1 ifconfig bridge0 deletem 2>/dev/null || true
-          }
-          ifconfig bridge0 destroy 2>/dev/null || true
-        }
+# Remove bridge0 interface
+ifconfig bridge0 &>/dev/null && {
+  ifconfig bridge0 | grep -q 'member' && {
+    ifconfig bridge0 | awk '/member/ {print $2}' | xargs -n1 ifconfig bridge0 deletem 2>/dev/null || true
+  }
+  ifconfig bridge0 destroy 2>/dev/null || true
+}

-        # Remove Thunderbolt Bridge from VirtualNetworkInterfaces in preferences.plist
-        /usr/libexec/PlistBuddy -c "Delete :VirtualNetworkInterfaces:Bridge:bridge0" "$PREFS" 2>/dev/null || true
+# Remove Thunderbolt Bridge from VirtualNetworkInterfaces in preferences.plist
+/usr/libexec/PlistBuddy -c "Delete :VirtualNetworkInterfaces:Bridge:bridge0" "$PREFS" 2>/dev/null || true

-        networksetup -listlocations | grep -q exo || {
-          networksetup -createlocation exo
-        }
+networksetup -listlocations | grep -q exo || {
+  networksetup -createlocation exo
+}

-        networksetup -switchtolocation exo
-        networksetup -listallhardwareports \\
-          | awk -F': ' '/Hardware Port: / {print $2}' \\
-          | while IFS=":" read -r name; do
-              case "$name" in
-                "Ethernet Adapter"*)
-                        ;;
-                "Thunderbolt Bridge")
-                        ;;
-                "Thunderbolt "*)
-                  networksetup -listallnetworkservices \\
-                    | grep -q "EXO $name" \\
-                      || networksetup -createnetworkservice "EXO $name" "$name" 2>/dev/null \\
-                      || continue
-                  networksetup -setdhcp "EXO $name"
-                        ;;
-                *)
-                  networksetup -listallnetworkservices \\
-                    | grep -q "$name" \\
-                      || networksetup -createnetworkservice "$name" "$name" 2>/dev/null \\
-                      || continue
-                        ;;
-              esac
-            done
+networksetup -switchtolocation exo
+networksetup -listallhardwareports \\
+  | awk -F': ' '/Hardware Port: / {print $2}' \\
+  | while IFS=":" read -r name; do
+      case "$name" in
+        "Ethernet Adapter"*)
+                ;;
+        "Thunderbolt Bridge")
+                ;;
+        "Thunderbolt "*)
+          networksetup -listallnetworkservices \\
+            | grep -q "EXO $name" \\
+              || networksetup -createnetworkservice "EXO $name" "$name" 2>/dev/null \\
+              || continue
+          networksetup -setdhcp "EXO $name"
+                ;;
+        *)
+          networksetup -listallnetworkservices \\
+            | grep -q "$name" \\
+              || networksetup -createnetworkservice "$name" "$name" 2>/dev/null \\
+              || continue
+                ;;
+      esac
+    done

-        networksetup -listnetworkservices | grep -q "Thunderbolt Bridge" && {
-          networksetup -setnetworkserviceenabled "Thunderbolt Bridge" off
-        } || true
-        """
+networksetup -listnetworkservices | grep -q "Thunderbolt Bridge" && {
+  networksetup -setnetworkserviceenabled "Thunderbolt Bridge" off
+} || true
+"""

    static func ensureLaunchDaemonInstalled() {
-        // Use .utility priority to match NSAppleScript's internal QoS and avoid priority inversion
-        Task.detached(priority: .utility) {
+        Task.detached {
            do {
                if daemonAlreadyInstalled() {
                    return
@@ -72,70 +70,11 @@ enum NetworkSetupHelper {
                try await installLaunchDaemon()
                logger.info("Network setup launch daemon installed and started")
            } catch {
-                logger.error(
-                    "Network setup launch daemon failed: \(error.localizedDescription, privacy: .public)"
-                )
+                logger.error("Network setup launch daemon failed: \(error.localizedDescription, privacy: .public)")
            }
        }
    }

-    /// Removes all EXO network setup components from the system.
-    /// This includes the LaunchDaemon, scripts, logs, and network location.
-    /// Requires admin privileges.
-    static func uninstall() throws {
-        let uninstallScript = makeUninstallScript()
-        try runShellAsAdmin(uninstallScript)
-        logger.info("EXO network setup components removed successfully")
-    }
-
-    /// Checks if there are any EXO network components installed that need cleanup
-    static func hasInstalledComponents() -> Bool {
-        let manager = FileManager.default
-        let scriptExists = manager.fileExists(atPath: scriptDestination)
-        let plistExists = manager.fileExists(atPath: plistDestination)
-        return scriptExists || plistExists
-    }
-
-    private static func makeUninstallScript() -> String {
-        """
-        set -euo pipefail
-
-        LABEL="\(daemonLabel)"
-        SCRIPT_DEST="\(scriptDestination)"
-        PLIST_DEST="\(plistDestination)"
-        LOG_OUT="/var/log/\(daemonLabel).log"
-        LOG_ERR="/var/log/\(daemonLabel).err.log"
-
-        # Unload the LaunchDaemon if running
-        launchctl bootout system/"$LABEL" 2>/dev/null || true
-
-        # Remove LaunchDaemon plist
-        rm -f "$PLIST_DEST"
-
-        # Remove the script and parent directory if empty
-        rm -f "$SCRIPT_DEST"
-        rmdir "$(dirname "$SCRIPT_DEST")" 2>/dev/null || true
-
-        # Remove log files
-        rm -f "$LOG_OUT" "$LOG_ERR"
-
-        # Switch back to Automatic network location
-        networksetup -switchtolocation Automatic 2>/dev/null || true
-
-        # Delete the exo network location if it exists
-        networksetup -listlocations | grep -q '^exo$' && {
-          networksetup -deletelocation exo 2>/dev/null || true
-        } || true
-
-        # Re-enable Thunderbolt Bridge if it exists
-        networksetup -listnetworkservices | grep -q "Thunderbolt Bridge" && {
-          networksetup -setnetworkserviceenabled "Thunderbolt Bridge" on 2>/dev/null || true
-        } || true
-
-        echo "EXO network components removed successfully"
-        """
-    }
-
    private static func daemonAlreadyInstalled() -> Bool {
        let manager = FileManager.default
        let scriptExists = manager.fileExists(atPath: scriptDestination)
@@ -143,8 +82,7 @@ enum NetworkSetupHelper {
        guard scriptExists, plistExists else { return false }
        guard
            let data = try? Data(contentsOf: URL(fileURLWithPath: plistDestination)),
-            let plist = try? PropertyListSerialization.propertyList(
-                from: data, options: [], format: nil) as? [String: Any]
+            let plist = try? PropertyListSerialization.propertyList(from: data, options: [], format: nil) as? [String: Any]
        else {
            return false
        }
@@ -154,9 +92,7 @@ enum NetworkSetupHelper {
        else {
            return false
        }
-        if let programArgs = plist["ProgramArguments"] as? [String],
-            programArgs.contains(scriptDestination) == false
-        {
+        if let programArgs = plist["ProgramArguments"] as? [String], programArgs.contains(scriptDestination) == false {
            return false
        }
        return true
@@ -169,59 +105,58 @@ enum NetworkSetupHelper {

    private static func makeInstallerScript() -> String {
        """
-        set -euo pipefail
+set -euo pipefail

-        LABEL="\(daemonLabel)"
-        SCRIPT_DEST="\(scriptDestination)"
-        PLIST_DEST="\(plistDestination)"
+LABEL="\(daemonLabel)"
+SCRIPT_DEST="\(scriptDestination)"
+PLIST_DEST="\(plistDestination)"

-        mkdir -p "$(dirname "$SCRIPT_DEST")"
+mkdir -p "$(dirname "$SCRIPT_DEST")"

-        cat > "$SCRIPT_DEST" <<'EOF_SCRIPT'
-        \(setupScript)
-        EOF_SCRIPT
-        chmod 755 "$SCRIPT_DEST"
+cat > "$SCRIPT_DEST" <<'EOF_SCRIPT'
+\(setupScript)
+EOF_SCRIPT
+chmod 755 "$SCRIPT_DEST"

-        cat > "$PLIST_DEST" <<'EOF_PLIST'
-        <?xml version="1.0" encoding="UTF-8"?>
-        <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
-        <plist version="1.0">
-        <dict>
-          <key>Label</key>
-          <string>\(daemonLabel)</string>
-          <key>ProgramArguments</key>
-          <array>
-            <string>/bin/bash</string>
-            <string>\(scriptDestination)</string>
-          </array>
-          <key>StartInterval</key>
-          <integer>\(requiredStartInterval)</integer>
-          <key>RunAtLoad</key>
-          <true/>
-          <key>StandardOutPath</key>
-          <string>/var/log/\(daemonLabel).log</string>
-          <key>StandardErrorPath</key>
-          <string>/var/log/\(daemonLabel).err.log</string>
-        </dict>
-        </plist>
-        EOF_PLIST
+cat > "$PLIST_DEST" <<'EOF_PLIST'
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+  <key>Label</key>
+  <string>\(daemonLabel)</string>
+  <key>ProgramArguments</key>
+  <array>
+    <string>/bin/bash</string>
+    <string>\(scriptDestination)</string>
+  </array>
+  <key>StartInterval</key>
+  <integer>\(requiredStartInterval)</integer>
+  <key>RunAtLoad</key>
+  <true/>
+  <key>StandardOutPath</key>
+  <string>/var/log/\(daemonLabel).log</string>
+  <key>StandardErrorPath</key>
+  <string>/var/log/\(daemonLabel).err.log</string>
+</dict>
+</plist>
+EOF_PLIST

-        launchctl bootout system/"$LABEL" >/dev/null 2>&1 || true
-        launchctl bootstrap system "$PLIST_DEST"
-        launchctl enable system/"$LABEL"
-        launchctl kickstart -k system/"$LABEL"
-        """
+launchctl bootout system/"$LABEL" >/dev/null 2>&1 || true
+launchctl bootstrap system "$PLIST_DEST"
+launchctl enable system/"$LABEL"
+launchctl kickstart -k system/"$LABEL"
+"""
    }

    private static func runShellAsAdmin(_ script: String) throws {
-        let escapedScript =
-            script
+        let escapedScript = script
            .replacingOccurrences(of: "\\", with: "\\\\")
            .replacingOccurrences(of: "\"", with: "\\\"")

        let appleScriptSource = """
-            do shell script "\(escapedScript)" with administrator privileges
-            """
+do shell script "\(escapedScript)" with administrator privileges
+"""

        guard let appleScript = NSAppleScript(source: appleScriptSource) else {
            throw NetworkSetupError.scriptCreationFailed
--- a/app/EXO/EXO/Services/NetworkStatusService.swift
+++ b/app/EXO/EXO/Services/NetworkStatusService.swift
@@ -35,34 +35,14 @@ struct NetworkStatus: Equatable {
    let thunderboltBridgeState: ThunderboltState?
    let bridgeInactive: Bool?
    let interfaceStatuses: [InterfaceIpStatus]
-    let rdmaStatus: RDMAStatus

    static let empty = NetworkStatus(
        thunderboltBridgeState: nil,
        bridgeInactive: nil,
-        interfaceStatuses: [],
-        rdmaStatus: .empty
+        interfaceStatuses: []
    )
 }

-struct RDMAStatus: Equatable {
-    let rdmaCtlEnabled: Bool?
-    let devices: [String]
-    let activePorts: [RDMAPort]
-
-    var isAvailable: Bool {
-        rdmaCtlEnabled == true || !devices.isEmpty
-    }
-
-    static let empty = RDMAStatus(rdmaCtlEnabled: nil, devices: [], activePorts: [])
-}
-
-struct RDMAPort: Equatable {
-    let device: String
-    let port: String
-    let state: String
-}
-
 struct InterfaceIpStatus: Equatable {
    let interfaceName: String
    let ipAddress: String?
@@ -79,79 +59,10 @@ private struct NetworkStatusFetcher {
        NetworkStatus(
            thunderboltBridgeState: readThunderboltBridgeState(),
            bridgeInactive: readBridgeInactive(),
-            interfaceStatuses: readInterfaceStatuses(),
-            rdmaStatus: readRDMAStatus()
+            interfaceStatuses: readInterfaceStatuses()
        )
    }

-    private func readRDMAStatus() -> RDMAStatus {
-        let rdmaCtlEnabled = readRDMACtlEnabled()
-        let devices = readRDMADevices()
-        let activePorts = readRDMAActivePorts()
-        return RDMAStatus(
-            rdmaCtlEnabled: rdmaCtlEnabled, devices: devices, activePorts: activePorts)
-    }
-
-    private func readRDMACtlEnabled() -> Bool? {
-        let result = runCommand(["rdma_ctl", "status"])
-        guard result.exitCode == 0 else { return nil }
-        let output = result.output.lowercased().trimmingCharacters(in: .whitespacesAndNewlines)
-        if output.contains("enabled") {
-            return true
-        }
-        if output.contains("disabled") {
-            return false
-        }
-        return nil
-    }
-
-    private func readRDMADevices() -> [String] {
-        let result = runCommand(["ibv_devices"])
-        guard result.exitCode == 0 else { return [] }
-        var devices: [String] = []
-        for line in result.output.split(separator: "\n") {
-            let trimmed = line.trimmingCharacters(in: .whitespaces)
-            if trimmed.hasPrefix("---") || trimmed.lowercased().hasPrefix("device")
-                || trimmed.isEmpty
-            {
-                continue
-            }
-            let parts = trimmed.split(separator: " ", maxSplits: 1)
-            if let deviceName = parts.first {
-                devices.append(String(deviceName))
-            }
-        }
-        return devices
-    }
-
-    private func readRDMAActivePorts() -> [RDMAPort] {
-        let result = runCommand(["ibv_devinfo"])
-        guard result.exitCode == 0 else { return [] }
-        var ports: [RDMAPort] = []
-        var currentDevice: String?
-        var currentPort: String?
-
-        for line in result.output.split(separator: "\n") {
-            let trimmed = line.trimmingCharacters(in: .whitespaces)
-            if trimmed.hasPrefix("hca_id:") {
-                currentDevice = trimmed.replacingOccurrences(of: "hca_id:", with: "")
-                    .trimmingCharacters(in: .whitespaces)
-            } else if trimmed.hasPrefix("port:") {
-                currentPort = trimmed.replacingOccurrences(of: "port:", with: "")
-                    .trimmingCharacters(in: .whitespaces)
-            } else if trimmed.hasPrefix("state:") {
-                let state = trimmed.replacingOccurrences(of: "state:", with: "").trimmingCharacters(
-                    in: .whitespaces)
-                if let device = currentDevice, let port = currentPort {
-                    if state.lowercased().contains("active") {
-                        ports.append(RDMAPort(device: device, port: port, state: state))
-                    }
-                }
-            }
-        }
-        return ports
-    }
-
    private func readThunderboltBridgeState() -> ThunderboltState? {
        let result = runCommand(["networksetup", "-getnetworkserviceenabled", "Thunderbolt Bridge"])
        guard result.exitCode == 0 else {
@@ -174,11 +85,10 @@ private struct NetworkStatusFetcher {
    private func readBridgeInactive() -> Bool? {
        let result = runCommand(["ifconfig", "bridge0"])
        guard result.exitCode == 0 else { return nil }
-        guard
-            let statusLine = result.output
-                .components(separatedBy: .newlines)
-                .first(where: { $0.contains("status:") })?
-                .lowercased()
+        guard let statusLine = result.output
+            .components(separatedBy: .newlines)
+            .first(where: { $0.contains("status:") })?
+            .lowercased()
        else {
            return nil
        }
@@ -261,3 +171,4 @@ private struct NetworkStatusFetcher {
        )
    }
 }
+
--- a/app/EXO/EXO/ViewModels/InstanceViewModel.swift
+++ b/app/EXO/EXO/ViewModels/InstanceViewModel.swift
@@ -57,7 +57,7 @@ struct InstanceViewModel: Identifiable, Equatable {
        case waiting
        case failed
        case idle
-        case preparing
+        case unknown

        var label: String {
            switch self {
@@ -68,7 +68,7 @@ struct InstanceViewModel: Identifiable, Equatable {
            case .waiting: return "Waiting"
            case .failed: return "Failed"
            case .idle: return "Idle"
-            case .preparing: return "Preparing"
+            case .unknown: return "Unknown"
            }
        }
    }
@@ -107,13 +107,10 @@ extension ClusterState {
            let nodeToRunner = instance.shardAssignments.nodeToRunner
            let nodeIds = Array(nodeToRunner.keys)
            let runnerIds = Array(nodeToRunner.values)
-            let nodeNames = nodeIds.compactMap {
-                nodeProfiles[$0]?.friendlyName ?? nodeProfiles[$0]?.modelId ?? $0
-            }
+            let nodeNames = nodeIds.compactMap { nodeProfiles[$0]?.friendlyName ?? nodeProfiles[$0]?.modelId ?? $0 }
            let statuses = runnerIds.compactMap { runners[$0]?.status.lowercased() }
            let downloadProgress = aggregateDownloadProgress(for: nodeIds)
-            let state = InstanceViewModel.State(
-                statuses: statuses, hasActiveDownload: downloadProgress != nil)
+            let state = InstanceViewModel.State(statuses: statuses, hasActiveDownload: downloadProgress != nil)
            let chatTasks = (chatTasksByInstance[entry.key] ?? [])
                .sorted(by: { $0.sortPriority < $1.sortPriority })
                .map { InstanceTaskViewModel(task: $0) }
@@ -168,8 +165,8 @@ extension ClusterState {
    }
 }

-extension InstanceViewModel.State {
-    fileprivate init(statuses: [String], hasActiveDownload: Bool = false) {
+private extension InstanceViewModel.State {
+    init(statuses: [String], hasActiveDownload: Bool = false) {
        if statuses.contains(where: { $0.contains("failed") }) {
            self = .failed
        } else if hasActiveDownload || statuses.contains(where: { $0.contains("downloading") }) {
@@ -185,7 +182,7 @@ extension InstanceViewModel.State {
        } else if statuses.isEmpty {
            self = .idle
        } else {
-            self = .preparing
+            self = .unknown
        }
    }
 }
@@ -246,3 +243,4 @@ extension InstanceTaskViewModel {
        self.parameters = task.parameters
    }
 }
+
--- a/app/EXO/EXO/ViewModels/NodeViewModel.swift
+++ b/app/EXO/EXO/ViewModels/NodeViewModel.swift
@@ -87,9 +87,7 @@ struct TopologyViewModel {
 extension ClusterState {
    func topologyViewModel(localNodeId: String?) -> TopologyViewModel? {
        let topologyNodeIds = Set(topology?.nodes.map(\.nodeId) ?? [])
-        let allNodes = nodeViewModels().filter {
-            topologyNodeIds.isEmpty || topologyNodeIds.contains($0.id)
-        }
+        let allNodes = nodeViewModels().filter { topologyNodeIds.isEmpty || topologyNodeIds.contains($0.id) }
        guard !allNodes.isEmpty else { return nil }

        let nodesById = Dictionary(uniqueKeysWithValues: allNodes.map { ($0.id, $0) })
@@ -108,24 +106,18 @@ extension ClusterState {
        }

        // Rotate so the local node (from /node_id API) is first
-        if let localId = localNodeId,
-            let index = orderedNodes.firstIndex(where: { $0.id == localId })
-        {
+        if let localId = localNodeId, let index = orderedNodes.firstIndex(where: { $0.id == localId }) {
            orderedNodes = Array(orderedNodes[index...]) + Array(orderedNodes[..<index])
        }

        let nodeIds = Set(orderedNodes.map(\.id))
-        let edgesArray: [TopologyEdgeViewModel] =
-            topology?.connections?.compactMap { connection in
-                guard nodeIds.contains(connection.localNodeId),
-                    nodeIds.contains(connection.sendBackNodeId)
-                else { return nil }
-                return TopologyEdgeViewModel(
-                    sourceId: connection.localNodeId, targetId: connection.sendBackNodeId)
-            } ?? []
+        let edgesArray: [TopologyEdgeViewModel] = topology?.connections?.compactMap { connection in
+            guard nodeIds.contains(connection.localNodeId), nodeIds.contains(connection.sendBackNodeId) else { return nil }
+            return TopologyEdgeViewModel(sourceId: connection.localNodeId, targetId: connection.sendBackNodeId)
+        } ?? []
        let edges = Set(edgesArray)

-        return TopologyViewModel(
-            nodes: orderedNodes, edges: Array(edges), currentNodeId: localNodeId)
+        return TopologyViewModel(nodes: orderedNodes, edges: Array(edges), currentNodeId: localNodeId)
    }
 }
+
--- a/app/EXO/EXO/Views/InstanceRowView.swift
+++ b/app/EXO/EXO/Views/InstanceRowView.swift
@@ -20,8 +20,8 @@ struct InstanceRowView: View {
                if let progress = instance.downloadProgress {
                    downloadStatusView(progress: progress)
                } else {
-                    statusChip(label: instance.state.label.uppercased(), color: statusColor)
-                }
+                statusChip(label: instance.state.label.uppercased(), color: statusColor)
+            }
            }
            if let progress = instance.downloadProgress {
                GeometryReader { geometry in
@@ -83,7 +83,7 @@ struct InstanceRowView: View {
        case .ready: return .teal
        case .waiting, .idle: return .gray
        case .failed: return .red
-        case .preparing: return .secondary
+        case .unknown: return .secondary
        }
    }

@@ -97,8 +97,7 @@ struct InstanceRowView: View {
                        .font(.caption)
                        .fontWeight(.semibold)
                    if let subtitle = task.subtitle,
-                        subtitle.caseInsensitiveCompare(parentModelName) != .orderedSame
-                    {
+                       subtitle.caseInsensitiveCompare(parentModelName) != .orderedSame {
                        Text(subtitle)
                            .font(.caption2)
                            .foregroundColor(.secondary)
@@ -235,12 +234,9 @@ struct InstanceRowView: View {
        Button {
            isExpanded.wrappedValue.toggle()
        } label: {
-            Label(
-                isExpanded.wrappedValue ? "Hide" : "Show",
-                systemImage: isExpanded.wrappedValue ? "chevron.up" : "chevron.down"
-            )
-            .labelStyle(.titleAndIcon)
-            .contentTransition(.symbolEffect(.replace))
+            Label(isExpanded.wrappedValue ? "Hide" : "Show", systemImage: isExpanded.wrappedValue ? "chevron.up" : "chevron.down")
+                .labelStyle(.titleAndIcon)
+                .contentTransition(.symbolEffect(.replace))
        }
        .buttonStyle(.plain)
        .font(.caption2)
@@ -315,9 +311,7 @@ struct InstanceRowView: View {
        }

        @ViewBuilder
-        private func detailRow(
-            icon: String? = nil, title: String, value: String, tint: Color = .secondary
-        ) -> some View {
+        private func detailRow(icon: String? = nil, title: String, value: String, tint: Color = .secondary) -> some View {
            HStack(alignment: .firstTextBaseline, spacing: 6) {
                if let icon {
                    Image(systemName: icon)
@@ -335,3 +329,4 @@ struct InstanceRowView: View {
        }
    }
 }
+
--- a/app/EXO/EXO/Views/NodeDetailView.swift
+++ b/app/EXO/EXO/Views/NodeDetailView.swift
@@ -32,3 +32,4 @@ struct NodeDetailView: View {
        }
    }
 }
+
--- a/app/EXO/EXO/Views/NodeRowView.swift
+++ b/app/EXO/EXO/Views/NodeRowView.swift
@@ -28,3 +28,4 @@ struct NodeRowView: View {
        .padding(.vertical, 4)
    }
 }
+
--- a/app/EXO/EXO/Views/TopologyMiniView.swift
+++ b/app/EXO/EXO/Views/TopologyMiniView.swift
@@ -76,33 +76,30 @@ struct TopologyMiniView: View {

    private func connectionLines(in size: CGSize) -> some View {
        let positions = positionedNodes(in: size)
-        let positionById = Dictionary(
-            uniqueKeysWithValues: positions.map { ($0.node.id, $0.point) })
+        let positionById = Dictionary(uniqueKeysWithValues: positions.map { ($0.node.id, $0.point) })
        return Canvas { context, _ in
            guard !topology.edges.isEmpty else { return }
            let nodeRadius: CGFloat = 32
            let arrowLength: CGFloat = 10
            let arrowSpread: CGFloat = .pi / 7
            for edge in topology.edges {
-                guard let start = positionById[edge.sourceId], let end = positionById[edge.targetId]
-                else { continue }
+                guard let start = positionById[edge.sourceId], let end = positionById[edge.targetId] else { continue }
                let dx = end.x - start.x
                let dy = end.y - start.y
                let distance = max(CGFloat(hypot(dx, dy)), 1)
                let ux = dx / distance
                let uy = dy / distance
-                let adjustedStart = CGPoint(
-                    x: start.x + ux * nodeRadius, y: start.y + uy * nodeRadius)
+                let adjustedStart = CGPoint(x: start.x + ux * nodeRadius, y: start.y + uy * nodeRadius)
                let adjustedEnd = CGPoint(x: end.x - ux * nodeRadius, y: end.y - uy * nodeRadius)

                var linePath = Path()
                linePath.move(to: adjustedStart)
                linePath.addLine(to: adjustedEnd)
-                context.stroke(
+            context.stroke(
                    linePath,
                    with: .color(.secondary.opacity(0.3)),
-                    style: StrokeStyle(lineWidth: 1, dash: [4, 4])
-                )
+                style: StrokeStyle(lineWidth: 1, dash: [4, 4])
+            )

                let angle = atan2(uy, ux)
                let tip = adjustedEnd
@@ -171,3 +168,5 @@ private struct NodeGlyphView: View {
        .frame(width: 95)
    }
 }
+
+
--- a/app/EXO/EXOTests/EXOTests.swift
+++ b/app/EXO/EXOTests/EXOTests.swift
@@ -6,7 +6,6 @@
 //

 import Testing
-
@testable import EXO

 struct EXOTests {
--- a/app/EXO/uninstall-exo.sh
+++ b/app/EXO/uninstall-exo.sh
@@ -1,154 +0,0 @@
-#!/usr/bin/env bash
-#
-# EXO Uninstaller Script
-#
-# This script removes all EXO system components that persist after deleting the app.
-# Run with: sudo ./uninstall-exo.sh
-#
-# Components removed:
-# - LaunchDaemon: /Library/LaunchDaemons/io.exo.networksetup.plist
-# - Network script: /Library/Application Support/EXO/
-# - Log files: /var/log/io.exo.networksetup.*
-# - Network location: "exo"
-# - Launch at login registration
-#
-
-set -euo pipefail
-
-LABEL="io.exo.networksetup"
-SCRIPT_DEST="/Library/Application Support/EXO/disable_bridge_enable_dhcp.sh"
-PLIST_DEST="/Library/LaunchDaemons/io.exo.networksetup.plist"
-LOG_OUT="/var/log/${LABEL}.log"
-LOG_ERR="/var/log/${LABEL}.err.log"
-APP_BUNDLE_ID="io.exo.EXO"
-
-# Colors for output
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-NC='\033[0m' # No Color
-
-echo_info() {
-    echo -e "${GREEN}[INFO]${NC} $1"
-}
-
-echo_warn() {
-    echo -e "${YELLOW}[WARN]${NC} $1"
-}
-
-echo_error() {
-    echo -e "${RED}[ERROR]${NC} $1"
-}
-
-# Check if running as root
-if [[ $EUID -ne 0 ]]; then
-    echo_error "This script must be run as root (use sudo)"
-    exit 1
-fi
-
-echo ""
-echo "========================================"
-echo "        EXO Uninstaller"
-echo "========================================"
-echo ""
-
-# Unload the LaunchDaemon if running
-echo_info "Stopping network setup daemon..."
-if launchctl list | grep -q "$LABEL"; then
-    launchctl bootout system/"$LABEL" 2>/dev/null || true
-    echo_info "Daemon stopped"
-else
-    echo_warn "Daemon was not running"
-fi
-
-# Remove LaunchDaemon plist
-if [[ -f "$PLIST_DEST" ]]; then
-    rm -f "$PLIST_DEST"
-    echo_info "Removed LaunchDaemon plist"
-else
-    echo_warn "LaunchDaemon plist not found (already removed?)"
-fi
-
-# Remove the script and parent directory
-if [[ -f "$SCRIPT_DEST" ]]; then
-    rm -f "$SCRIPT_DEST"
-    echo_info "Removed network setup script"
-else
-    echo_warn "Network setup script not found (already removed?)"
-fi
-
-# Remove EXO directory if empty
-if [[ -d "/Library/Application Support/EXO" ]]; then
-    rmdir "/Library/Application Support/EXO" 2>/dev/null && \
-        echo_info "Removed EXO support directory" || \
-        echo_warn "EXO support directory not empty, leaving in place"
-fi
-
-# Remove log files
-if [[ -f "$LOG_OUT" ]] || [[ -f "$LOG_ERR" ]]; then
-    rm -f "$LOG_OUT" "$LOG_ERR"
-    echo_info "Removed log files"
-else
-    echo_warn "Log files not found (already removed?)"
-fi
-
-# Switch back to Automatic network location
-echo_info "Restoring network configuration..."
-if networksetup -listlocations | grep -q "^Automatic$"; then
-    networksetup -switchtolocation Automatic 2>/dev/null || true
-    echo_info "Switched to Automatic network location"
-else
-    echo_warn "Automatic network location not found"
-fi
-
-# Delete the exo network location if it exists
-if networksetup -listlocations | grep -q "^exo$"; then
-    networksetup -deletelocation exo 2>/dev/null || true
-    echo_info "Deleted 'exo' network location"
-else
-    echo_warn "'exo' network location not found (already removed?)"
-fi
-
-# Re-enable Thunderbolt Bridge if it exists
-if networksetup -listnetworkservices 2>/dev/null | grep -q "Thunderbolt Bridge"; then
-    networksetup -setnetworkserviceenabled "Thunderbolt Bridge" on 2>/dev/null || true
-    echo_info "Re-enabled Thunderbolt Bridge"
-fi
-
-# Note about launch at login registration
-# SMAppService-based login items cannot be removed from a shell script.
-# They can only be unregistered from within the app itself or manually via System Settings.
-echo_warn "Launch at login must be removed manually:"
-echo_warn "  System Settings → General → Login Items → Remove EXO"
-
-# Check if EXO.app exists in common locations
-APP_FOUND=false
-for app_path in "/Applications/EXO.app" "$HOME/Applications/EXO.app"; do
-    if [[ -d "$app_path" ]]; then
-        if [[ "$APP_FOUND" == false ]]; then
-            echo ""
-            APP_FOUND=true
-        fi
-        echo_warn "EXO.app found at: $app_path"
-        echo_warn "You may want to move it to Trash manually."
-    fi
-done
-
-echo ""
-echo "========================================"
-echo_info "EXO uninstall complete!"
-echo "========================================"
-echo ""
-echo "The following have been removed:"
-echo "  • Network setup LaunchDaemon"
-echo "  • Network configuration script"
-echo "  • Log files"
-echo "  • 'exo' network location"
-echo ""
-echo "Your network has been restored to use the 'Automatic' location."
-echo "Thunderbolt Bridge has been re-enabled (if present)."
-echo ""
-echo "Manual step required:"
-echo "  Remove EXO from Login Items in System Settings → General → Login Items"
-echo ""
-
--- a/bench/exo_bench.py
+++ b/bench/exo_bench.py
@@ -1,566 +0,0 @@
-#!/usr/bin/env python3
-# pyright: reportAny=false, reportUnknownMemberType=false, reportUnknownVariableType=false, reportUnknownArgumentType=false
-from __future__ import annotations
-
-import argparse
-import contextlib
-import http.client
-import json
-import os
-import time
-from collections.abc import Callable
-from statistics import mean
-from typing import Any
-from urllib.parse import urlencode
-
-from loguru import logger
-from transformers import AutoTokenizer
-
-from exo.shared.models.model_cards import MODEL_CARDS
-from exo.shared.types.memory import Memory
-
-
-class ExoHttpError(RuntimeError):
-    def __init__(self, status: int, reason: str, body_preview: str):
-        super().__init__(f"HTTP {status} {reason}: {body_preview}")
-        self.status = status
-
-
-class ExoClient:
-    def __init__(self, host: str, port: int, timeout_s: float = 600.0):
-        self.host = host
-        self.port = port
-        self.timeout_s = timeout_s
-
-    def request_json(
-        self,
-        method: str,
-        path: str,
-        params: dict[str, Any] | None = None,
-        body: dict[str, Any] | None = None,
-        headers: dict[str, str] | None = None,
-    ) -> Any:
-        if not path.startswith("/"):
-            path = "/" + path
-        if params:
-            path = path + "?" + urlencode(params)
-
-        conn = http.client.HTTPConnection(self.host, self.port, timeout=self.timeout_s)
-        try:
-            payload: bytes | None = None
-            hdrs: dict[str, str] = {"Accept": "application/json"}
-
-            if body is not None:
-                payload = json.dumps(body).encode("utf-8")
-                hdrs["Content-Type"] = "application/json"
-            if headers:
-                hdrs.update(headers)
-
-            conn.request(method.upper(), path, body=payload, headers=hdrs)
-            resp = conn.getresponse()
-            raw = resp.read()
-            text = raw.decode("utf-8", errors="replace") if raw else ""
-
-            if resp.status >= 400:
-                raise ExoHttpError(resp.status, resp.reason, text[:300])
-
-            if not text:
-                return None
-            return json.loads(text)
-        finally:
-            conn.close()
-
-    def post_bench_chat_completions(self, payload: dict[str, Any]) -> dict[str, Any]:
-        return self.request_json("POST", "/bench/chat/completions", body=payload)
-
-
-def unwrap_instance(instance: dict[str, Any]) -> dict[str, Any]:
-    if len(instance) != 1:
-        raise KeyError(f"Expected 1 key, got keys={list(instance.keys())}")
-
-    tag = next(iter(instance))
-    inner = instance[tag]
-    if not isinstance(inner, dict):
-        raise TypeError(f"payload for {tag} must be dict, got {type(inner)}")
-    return inner
-
-
-def instance_id_from_instance(instance: dict[str, Any]) -> str:
-    inner = unwrap_instance(instance)
-    return str(inner["instanceId"])
-
-
-def nodes_used_in_instance(instance: dict[str, Any]) -> int:
-    inner = unwrap_instance(instance)
-    return len(inner["shardAssignments"]["nodeToRunner"])
-
-
-def runner_ids_from_instance(instance: dict[str, Any]) -> list[str]:
-    inner = unwrap_instance(instance)
-    runner_to_shard = inner["shardAssignments"]["runnerToShard"]
-    return list(runner_to_shard.keys())
-
-
-def runner_ready(runner: dict[str, Any]) -> bool:
-    return "RunnerReady" in runner
-
-
-def runner_failed(runner: dict[str, Any]) -> bool:
-    return "RunnerFailed" in runner
-
-
-def get_runner_failed_message(runner: dict[str, Any]) -> str | None:
-    if "RunnerFailed" in runner:
-        return runner["RunnerFailed"].get("errorMessage")
-    return None
-
-
-def wait_for_instance_ready(
-    client: ExoClient, instance_id: str, timeout: float = 24000.0
-) -> None:
-    start_time = time.time()
-    instance_existed = False
-    while time.time() - start_time < timeout:
-        state = client.request_json("GET", "/state")
-        instances = state.get("instances", {})
-
-        if instance_id not in instances:
-            if instance_existed:
-                # Instance was deleted after being created - likely due to runner failure
-                raise RuntimeError(
-                    f"Instance {instance_id} was deleted (runner may have failed)"
-                )
-            time.sleep(0.1)
-            continue
-
-        instance_existed = True
-        instance = instances[instance_id]
-        runner_ids = runner_ids_from_instance(instance)
-        runners = state.get("runners", {})
-
-        # Check for failed runners first
-        for rid in runner_ids:
-            runner = runners.get(rid, {})
-            if runner_failed(runner):
-                error_msg = get_runner_failed_message(runner) or "Unknown error"
-                raise RuntimeError(f"Runner {rid} failed: {error_msg}")
-
-        if all(runner_ready(runners.get(rid, {})) for rid in runner_ids):
-            return
-
-        time.sleep(0.1)
-
-    raise TimeoutError(f"Instance {instance_id} did not become ready within {timeout=}")
-
-
-def wait_for_instance_gone(
-    client: ExoClient, instance_id: str, timeout: float = 3.0
-) -> None:
-    start_time = time.time()
-    while time.time() - start_time < timeout:
-        try:
-            client.request_json("GET", f"/instance/{instance_id}")
-            time.sleep(0.4)
-        except ExoHttpError as e:
-            if e.status == 404:
-                return
-
-    raise TimeoutError(f"Instance {instance_id} did not get deleted within {timeout=}")
-
-
-def format_peak_memory(b: float) -> str:
-    for unit in ["B", "KB", "MB", "GB", "TB"]:
-        if b < 1024.0:
-            return f"{b:.2f}{unit}"
-        b /= 1024.0
-    raise ValueError("You're using petabytes of memory. Something went wrong...")
-
-
-def parse_int_list(values: list[str]) -> list[int]:
-    items: list[int] = []
-    for v in values:
-        for part in v.split(","):
-            part = part.strip()
-            if part:
-                items.append(int(part))
-
-    seen: set[int] = set()
-    out: list[int] = []
-    for x in items:
-        if x not in seen:
-            out.append(x)
-            seen.add(x)
-    return out
-
-
-def resolve_model_short_id(client: ExoClient, model_arg: str) -> tuple[str, str]:
-    models = client.request_json("GET", "/models") or {}
-    data = models.get("data") or []
-
-    for m in data:
-        if m.get("id") == model_arg:
-            short_id = str(m["id"])
-            full_id = str(m.get("hugging_face_id") or m["id"])
-            return short_id, full_id
-
-    for m in data:
-        if m.get("hugging_face_id") == model_arg:
-            short_id = str(m["id"])
-            full_id = str(m["hugging_face_id"])
-            return short_id, full_id
-
-    raise ValueError(f"Model not found in /models: {model_arg}")
-
-
-def placement_filter(instance_meta: str, wanted: str) -> bool:
-    s = (instance_meta or "").lower()
-    if wanted == "both":
-        return ("ring" in s) or ("jaccl" in s)
-    return wanted in s
-
-
-def sharding_filter(sharding: str, wanted: str) -> bool:
-    s = (sharding or "").lower()
-    if wanted == "both":
-        return ("pipeline" in s) or ("tensor" in s)
-    return wanted in s
-
-
-def run_one_completion(
-    client: ExoClient, model_id: str, pp_hint: int, tg: int, prompt_sizer: PromptSizer
-) -> tuple[dict[str, Any], int]:
-    content, pp_tokens = prompt_sizer.build(pp_hint)
-    payload: dict[str, Any] = {
-        "model": model_id,
-        "messages": [{"role": "user", "content": content}],
-        "stream": False,
-        "max_tokens": tg,
-    }
-
-    t0 = time.perf_counter()
-    out = client.post_bench_chat_completions(payload)
-    elapsed = time.perf_counter() - t0
-
-    stats = out.get("generation_stats")
-
-    preview = (out.get("choices") or [{}])[0]["message"]["content"][:200]
-
-    return {
-        "elapsed_s": elapsed,
-        "output_text_preview": preview,
-        "stats": stats,
-    }, pp_tokens
-
-
-class PromptSizer:
-    def __init__(self, tokenizer: Any, atom: str = "a "):
-        self.tokenizer = tokenizer
-        self.atom = atom
-        self.count_fn = PromptSizer._make_counter(tokenizer)
-        self.base_tokens = self.count_fn("")
-
-    @staticmethod
-    def _make_counter(tokenizer: Any) -> Callable[[str], int]:
-        def count_fn(user_content: str) -> int:
-            messages = [{"role": "user", "content": user_content}]
-            ids = tokenizer.apply_chat_template(
-                messages, tokenize=True, add_generation_prompt=True
-            )
-            # Fix for transformers 5.x
-            if hasattr(ids, "input_ids"):
-                ids = ids.input_ids
-            return int(len(ids))
-
-        return count_fn
-
-    def build(self, target_prompt_tokens: int) -> tuple[str, int]:
-        target = int(target_prompt_tokens)
-        if target < self.base_tokens:
-            raise RuntimeError(
-                f"Target ({target}) is smaller than template overhead ({self.base_tokens})."
-            )
-
-        content = ""
-        tok = self.count_fn(content)
-
-        while tok < target:
-            content += self.atom
-            tok = self.count_fn(content)
-
-        if tok != target:
-            raise RuntimeError(
-                f"Overshot: got {tok} tokens (target {target}). "
-                f"Pick a different atom (try ' a' or '\\n' or '0 ')."
-            )
-
-        return content, tok
-
-
-def main() -> int:
-    ap = argparse.ArgumentParser(
-        prog="exo-bench",
-        description="Benchmark exo model throughput across placement previews.",
-    )
-    ap.add_argument("--host", default=os.environ.get("EXO_HOST", "localhost"))
-    ap.add_argument(
-        "--port", type=int, default=int(os.environ.get("EXO_PORT", "52415"))
-    )
-    ap.add_argument("--model", required=True, help="Model short id or huggingface id")
-    ap.add_argument(
-        "--pp",
-        nargs="+",
-        required=True,
-        help="Prompt-size hints (ints). Accepts commas.",
-    )
-    ap.add_argument(
-        "--tg",
-        nargs="+",
-        required=True,
-        help="Generation lengths (ints). Accepts commas.",
-    )
-    ap.add_argument(
-        "--max-nodes",
-        type=int,
-        default=4,
-        help="Only consider placements using <= this many nodes.",
-    )
-    ap.add_argument(
-        "--min-nodes",
-        type=int,
-        default=1,
-        help="Only consider placements using >= this many nodes.",
-    )
-    ap.add_argument(
-        "--instance-meta", choices=["ring", "jaccl", "both"], default="both"
-    )
-    ap.add_argument(
-        "--sharding", choices=["pipeline", "tensor", "both"], default="both"
-    )
-    ap.add_argument(
-        "--skip-pipeline-jaccl",
-        action="store_true",
-        help="Pipeline jaccl is often pointless, skip by default",
-    )
-    ap.add_argument(
-        "--repeat", type=int, default=1, help="Repetitions per (pp,tg) pair."
-    )
-    ap.add_argument(
-        "--warmup",
-        type=int,
-        default=0,
-        help="Warmup runs per placement (uses first pp/tg).",
-    )
-    ap.add_argument(
-        "--timeout", type=float, default=600.0, help="HTTP timeout (seconds)."
-    )
-    ap.add_argument(
-        "--json-out",
-        default="bench/results.json",
-        help="Write raw per-run results JSON to this path.",
-    )
-    ap.add_argument(
-        "--dry-run", action="store_true", help="List selected placements and exit."
-    )
-    args = ap.parse_args()
-
-    pp_list = parse_int_list(args.pp)
-    tg_list = parse_int_list(args.tg)
-    if not pp_list or not tg_list:
-        logger.error("pp and tg lists must be non-empty")
-        return 2
-    if args.repeat <= 0:
-        logger.error("--repeat must be >= 1")
-        return 2
-
-    client = ExoClient(args.host, args.port, timeout_s=args.timeout)
-    short_id, full_model_id = resolve_model_short_id(client, args.model)
-
-    previews_resp = client.request_json(
-        "GET", "/instance/previews", params={"model_id": short_id}
-    )
-    previews = previews_resp.get("previews") or []
-
-    tokenizer = AutoTokenizer.from_pretrained(
-        full_model_id,
-        trust_remote_code=True,
-    )
-    if tokenizer is None:
-        raise RuntimeError("[exo-bench] tokenizer load failed")
-
-    try:
-        prompt_sizer = PromptSizer(tokenizer)
-        logger.debug(f"[exo-bench] loaded tokenizer: {full_model_id} for prompt sizer")
-    except Exception:
-        logger.error("[exo-bench] tokenizer usable but prompt sizing failed")
-        raise
-
-    selected: list[dict[str, Any]] = []
-    for p in previews:
-        if p.get("error") is not None:
-            continue
-        if not placement_filter(str(p.get("instance_meta", "")), args.instance_meta):
-            continue
-        if not sharding_filter(str(p.get("sharding", "")), args.sharding):
-            continue
-
-        instance = p.get("instance")
-        if not isinstance(instance, dict):
-            continue
-
-        n = nodes_used_in_instance(instance)
-        # Skip tensor ring single node as it is pointless when pipeline ring
-        if n == 1 and (
-            (args.sharding == "both" and "tensor" in p.get("sharding", "").lower())
-            or (
-                args.instance_meta == "both"
-                and "jaccl" in p.get("instance_meta", "").lower()
-            )
-        ):
-            continue
-
-        if (
-            args.skip_pipeline_jaccl
-            and (
-                args.instance_meta == "both"
-                and "jaccl" in p.get("instance_meta", "").lower()
-            )
-            and (
-                args.sharding == "both" and "pipeline" in p.get("sharding", "").lower()
-            )
-        ):
-            continue
-
-        if args.min_nodes <= n <= args.max_nodes:
-            selected.append(p)
-
-    if not selected:
-        logger.error("No valid placements matched your filters.")
-        return 1
-
-    selected.sort(
-        key=lambda p: (
-            str(p.get("instance_meta", "")),
-            str(p.get("sharding", "")),
-            -nodes_used_in_instance(p["instance"]),
-        ),
-        reverse=True,
-    )
-
-    logger.debug(f"exo-bench model: short_id={short_id} full_id={full_model_id}")
-    logger.info(f"placements: {len(selected)}")
-    for p in selected:
-        logger.info(
-            f"  - {p['sharding']} / {p['instance_meta']} / nodes={nodes_used_in_instance(p['instance'])}"
-        )
-
-    if args.dry_run:
-        return 0
-
-    all_rows: list[dict[str, Any]] = []
-
-    for preview in selected:
-        instance = preview["instance"]
-        instance_id = instance_id_from_instance(instance)
-
-        sharding = str(preview["sharding"])
-        instance_meta = str(preview["instance_meta"])
-        n_nodes = nodes_used_in_instance(instance)
-
-        logger.info("=" * 80)
-        logger.info(
-            f"PLACEMENT: {sharding} / {instance_meta} / nodes={n_nodes} / instance_id={instance_id}"
-        )
-
-        client.request_json("POST", "/instance", body={"instance": instance})
-        try:
-            wait_for_instance_ready(client, instance_id)
-        except (RuntimeError, TimeoutError) as e:
-            logger.error(f"Failed to initialize placement: {e}")
-            with contextlib.suppress(ExoHttpError):
-                client.request_json("DELETE", f"/instance/{instance_id}")
-            continue
-
-        time.sleep(1)
-
-        try:
-            for i in range(args.warmup):
-                run_one_completion(
-                    client, full_model_id, pp_list[0], tg_list[0], prompt_sizer
-                )
-                logger.debug(f"  warmup {i + 1}/{args.warmup} done")
-
-            for pp in pp_list:
-                if (
-                    pp * n_nodes > 2048
-                    and "ring" in instance_meta.lower()
-                    and "tensor" in sharding.lower()
-                ):
-                    model_card = MODEL_CARDS[short_id]
-                    if model_card.metadata.storage_size > Memory.from_gb(10):
-                        logger.info(
-                            f"Skipping tensor ring as this is too slow for model of size {model_card.metadata.storage_size} on {n_nodes=}"
-                        )
-                        continue
-                for tg in tg_list:
-                    runs: list[dict[str, Any]] = []
-                    for r in range(args.repeat):
-                        time.sleep(3)
-                        try:
-                            row, actual_pp_tokens = run_one_completion(
-                                client, full_model_id, pp, tg, prompt_sizer
-                            )
-                        except Exception as e:
-                            logger.error(e)
-                            continue
-                        row.update(
-                            {
-                                "model_short_id": short_id,
-                                "model_id": full_model_id,
-                                "placement_sharding": sharding,
-                                "placement_instance_meta": instance_meta,
-                                "placement_nodes": n_nodes,
-                                "instance_id": instance_id,
-                                "pp_tokens": actual_pp_tokens,
-                                "tg": tg,
-                                "repeat_index": r,
-                            }
-                        )
-                        runs.append(row)
-                        all_rows.append(row)
-
-                    if runs:
-                        prompt_tps = mean(x["stats"]["prompt_tps"] for x in runs)
-                        gen_tps = mean(x["stats"]["generation_tps"] for x in runs)
-                        ptok = mean(x["stats"]["prompt_tokens"] for x in runs)
-                        gtok = mean(x["stats"]["generation_tokens"] for x in runs)
-                        peak = mean(
-                            x["stats"]["peak_memory_usage"]["inBytes"] for x in runs
-                        )
-
-                        logger.info(
-                            f"prompt_tps={prompt_tps:.2f} gen_tps={gen_tps:.2f}    "
-                            f"prompt_tokens={ptok} gen_tokens={gtok}    "
-                            f"peak_memory={format_peak_memory(peak)}\n"
-                        )
-                    time.sleep(2)
-        finally:
-            try:
-                client.request_json("DELETE", f"/instance/{instance_id}")
-            except ExoHttpError as e:
-                if e.status != 404:
-                    raise
-            wait_for_instance_gone(client, instance_id)
-            logger.debug(f"Deleted instance {instance_id}")
-
-            time.sleep(5)
-
-    if args.json_out:
-        with open(args.json_out, "w", encoding="utf-8") as f:
-            json.dump(all_rows, f, indent=2, ensure_ascii=False)
-        logger.debug(f"\nWrote results JSON: {args.json_out}")
-
-    return 0
-
-
-if __name__ == "__main__":
-    raise SystemExit(main())
--- a/dashboard/dashboard.nix
+++ b/dashboard/dashboard.nix
@@ -1,60 +0,0 @@
-{ lib
-, config
-, dream2nix
-, ...
-}:
-let
-  # Read and parse the lock file
-  rawLockFile = builtins.fromJSON (builtins.readFile "${config.deps.dashboardSrc}/package-lock.json");
-
-  # For packages with bundleDependencies, filter out deps that are bundled
-  # (bundled deps are inside the tarball, not separate lockfile entries)
-  fixedPackages = lib.mapAttrs
-    (path: entry:
-      if entry ? bundleDependencies && entry.bundleDependencies != [ ]
-      then entry // {
-        dependencies = lib.filterAttrs
-          (name: _: !(lib.elem name entry.bundleDependencies))
-          (entry.dependencies or { });
-      }
-      else entry
-    )
-    (rawLockFile.packages or { });
-
-  fixedLockFile = rawLockFile // { packages = fixedPackages; };
-in
-{
-  imports = [
-    dream2nix.modules.dream2nix.nodejs-package-lock-v3
-    dream2nix.modules.dream2nix.nodejs-granular-v3
-  ];
-
-  name = "exo-dashboard";
-  version = "1.0.0";
-
-  mkDerivation = {
-    src = config.deps.dashboardSrc;
-
-    buildPhase = ''
-      runHook preBuild
-      npm run build
-      runHook postBuild
-    '';
-
-    installPhase = ''
-      runHook preInstall
-      cp -r build $out/build
-      runHook postInstall
-    '';
-  };
-
-  deps = { nixpkgs, ... }: {
-    inherit (nixpkgs) stdenv;
-    dashboardSrc = null; # Injected by parts.nix
-  };
-
-  nodejs-package-lock-v3 = {
-    # Don't use packageLockFile - provide the fixed lock content directly
-    packageLock = fixedLockFile;
-  };
-}
--- a/dashboard/package-lock.json
+++ b/dashboard/package-lock.json
@@ -863,7 +863,6 @@
 			"integrity": "sha512-oH8tXw7EZnie8FdOWYrF7Yn4IKrqTFHhXvl8YxXxbKwTMcD/5NNCryUSEXRk2ZR4ojnub0P8rNrsVGHXWqIDtA==",
 			"dev": true,
 			"license": "MIT",
-			"peer": true,
 			"dependencies": {
 				"@standard-schema/spec": "^1.0.0",
 				"@sveltejs/acorn-typescript": "^1.0.5",
@@ -903,7 +902,6 @@
 			"integrity": "sha512-Y1Cs7hhTc+a5E9Va/xwKlAJoariQyHY+5zBgCZg4PFWNYQ1nMN9sjK1zhw1gK69DuqVP++sht/1GZg1aRwmAXQ==",
 			"dev": true,
 			"license": "MIT",
-			"peer": true,
 			"dependencies": {
 				"@sveltejs/vite-plugin-svelte-inspector": "^4.0.1",
 				"debug": "^4.4.1",
@@ -1520,7 +1518,6 @@
 			"integrity": "sha512-LCCV0HdSZZZb34qifBsyWlUmok6W7ouER+oQIGBScS8EsZsQbrtFTUrDX4hOl+CS6p7cnNC4td+qrSVGSCTUfQ==",
 			"dev": true,
 			"license": "MIT",
-			"peer": true,
 			"dependencies": {
 				"undici-types": "~6.21.0"
 			}
@@ -1530,7 +1527,6 @@
 			"resolved": "https://registry.npmjs.org/acorn/-/acorn-8.15.0.tgz",
 			"integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==",
 			"license": "MIT",
-			"peer": true,
 			"bin": {
 				"acorn": "bin/acorn"
 			},
@@ -1943,7 +1939,6 @@
 			"integrity": "sha512-fmTRWbNMmsmWq6xJV8D19U/gw/bwrHfNXxrIN+HfZgnzqTHp9jOmKMhsTUjXOJnZOdZY9Q28y4yebKzqDKlxlQ==",
 			"dev": true,
 			"license": "ISC",
-			"peer": true,
 			"engines": {
 				"node": ">=12"
 			}
@@ -2651,7 +2646,6 @@
 			"integrity": "sha512-5gTmgEY/sqK6gFXLIsQNH19lWb4ebPDLA4SdLP7dsWkIXHWlG66oPuVvXSGFPppYZz8ZDZq0dYYrbHfBCVUb1Q==",
 			"dev": true,
 			"license": "MIT",
-			"peer": true,
 			"engines": {
 				"node": ">=12"
 			},
@@ -2839,7 +2833,6 @@
 			"resolved": "https://registry.npmjs.org/svelte/-/svelte-5.45.3.tgz",
 			"integrity": "sha512-ngKXNhNvwPzF43QqEhDOue7TQTrG09em1sd4HBxVF0Wr2gopAmdEWan+rgbdgK4fhBtSOTJO8bYU4chUG7VXZQ==",
 			"license": "MIT",
-			"peer": true,
 			"dependencies": {
 				"@jridgewell/remapping": "^2.3.4",
 				"@jridgewell/sourcemap-codec": "^1.5.0",
@@ -2984,7 +2977,6 @@
 			"integrity": "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw==",
 			"dev": true,
 			"license": "Apache-2.0",
-			"peer": true,
 			"bin": {
 				"tsc": "bin/tsc",
 				"tsserver": "bin/tsserver"
@@ -3006,7 +2998,6 @@
 			"integrity": "sha512-+Oxm7q9hDoLMyJOYfUYBuHQo+dkAloi33apOPP56pzj+vsdJDzr+j1NISE5pyaAuKL4A3UD34qd0lx5+kfKp2g==",
 			"dev": true,
 			"license": "MIT",
-			"peer": true,
 			"dependencies": {
 				"esbuild": "^0.25.0",
 				"fdir": "^6.4.4",
--- a/dashboard/parts.nix
+++ b/dashboard/parts.nix
@@ -1,44 +0,0 @@
-{ inputs, ... }:
-{
-  perSystem =
-    { pkgs, lib, ... }:
-    let
-      # Filter source to only include dashboard directory
-      src = lib.cleanSourceWith {
-        src = inputs.self;
-        filter =
-          path: type:
-          let
-            baseName = builtins.baseNameOf path;
-            inDashboardDir =
-              (lib.hasInfix "/dashboard/" path)
-              || (lib.hasSuffix "/dashboard" (builtins.dirOf path))
-              || (baseName == "dashboard" && type == "directory");
-          in
-          inDashboardDir;
-      };
-
-      # Build the dashboard with dream2nix (includes node_modules in output)
-      dashboardFull = inputs.dream2nix.lib.evalModules {
-        packageSets.nixpkgs = pkgs;
-        modules = [
-          ./dashboard.nix
-          {
-            paths.projectRoot = inputs.self;
-            paths.projectRootFile = "flake.nix";
-            paths.package = inputs.self + "/dashboard";
-          }
-          # Inject the filtered source
-          {
-            deps.dashboardSrc = lib.mkForce "${src}/dashboard";
-          }
-        ];
-      };
-    in
-    {
-      # Extract just the static site from the full build
-      packages.dashboard = pkgs.runCommand "exo-dashboard" { } ''
-        cp -r ${dashboardFull}/build $out
-      '';
-    };
-}
--- a/dashboard/src/app.d.ts
+++ b/dashboard/src/app.d.ts
@@ -11,3 +11,4 @@ declare global {
 }

 export {};
+
--- a/dashboard/src/lib/components/ChatForm.svelte
+++ b/dashboard/src/lib/components/ChatForm.svelte
@@ -60,39 +60,12 @@
 		return models;
 	});

-	// Track previous model IDs to detect newly added models (plain variable to avoid reactive loop)
-	let previousModelIds: Set<string> = new Set();
-
-	// Auto-select the first available model if none is selected, if current selection is stale, or if a new model is added
+	// Auto-select the first available model if none is selected
 	$effect(() => {
 		const models = availableModels();
-		const currentModelIds = new Set(models.map(m => m.id));
-
-		if (models.length > 0) {
-			// Find newly added models (in current but not in previous)
-			const newModels = models.filter(m => !previousModelIds.has(m.id));
-
-			// If no model selected, select the first available
-			if (!currentModel) {
-				setSelectedChatModel(models[0].id);
-			}
-			// If current model is stale (no longer has a running instance), reset to first available
-			else if (!models.some(m => m.id === currentModel)) {
-				setSelectedChatModel(models[0].id);
-			}
-			// If a new model was just added, select it
-			else if (newModels.length > 0 && previousModelIds.size > 0) {
-				setSelectedChatModel(newModels[0].id);
-			}
-		} else {
-			// No instances running - clear the selected model
-			if (currentModel) {
-				setSelectedChatModel('');
-			}
+		if (models.length > 0 && !currentModel) {
+			setSelectedChatModel(models[0].id);
 		}
-
-		// Update previous model IDs for next comparison
-		previousModelIds = currentModelIds;
 	});

 	function getInstanceModelId(instanceWrapped: unknown): string {
--- a/dashboard/src/lib/components/ChatMessages.svelte
+++ b/dashboard/src/lib/components/ChatMessages.svelte
@@ -1,16 +1,14 @@
 <script lang="ts">
-	import {
-		messages,
-		currentResponse,
+	import { 
+		messages, 
+		currentResponse, 
 		isLoading,
 		deleteMessage,
 		editAndRegenerate,
-		regenerateLastResponse,
-		regenerateFromToken
+		regenerateLastResponse
 	} from '$lib/stores/app.svelte';
 	import type { MessageAttachment } from '$lib/stores/app.svelte';
 	import MarkdownContent from './MarkdownContent.svelte';
-	import TokenHeatmap from './TokenHeatmap.svelte';

 	interface Props {
 		class?: string;
@@ -97,23 +95,6 @@
 let copiedMessageId = $state<string | null>(null);
 let expandedThinkingMessageIds = $state<Set<string>>(new Set());

-// Uncertainty view state - tracks which messages show token heatmap
-let uncertaintyViewMessageIds = $state<Set<string>>(new Set());
-
-function toggleUncertaintyView(messageId: string) {
-	const newSet = new Set(uncertaintyViewMessageIds);
-	if (newSet.has(messageId)) {
-		newSet.delete(messageId);
-	} else {
-		newSet.add(messageId);
-	}
-	uncertaintyViewMessageIds = newSet;
-}
-
-function isUncertaintyViewEnabled(messageId: string): boolean {
-	return uncertaintyViewMessageIds.has(messageId);
-}
-
 	function formatTimestamp(timestamp: number): string {
 		return new Date(timestamp).toLocaleTimeString('en-US', { 
 			hour12: false,
@@ -385,17 +366,7 @@ function isThinkingExpanded(messageId: string): boolean {
 									</div>
 								{/if}
 								<div class="text-xs text-foreground">
-									{#if message.role === 'assistant' && isUncertaintyViewEnabled(message.id) && message.tokens && message.tokens.length > 0}
-										<!-- Uncertainty heatmap view -->
-										<TokenHeatmap
-											tokens={message.tokens}
-											isGenerating={loading}
-											onRegenerateFrom={(tokenIndex) => regenerateFromToken(message.id, tokenIndex)}
-										/>
-									{:else}
-										<!-- Normal markdown view -->
-										<MarkdownContent content={message.content || (loading ? response : '')} />
-									{/if}
+									<MarkdownContent content={message.content || (loading ? response : '')} />
 									{#if loading && !message.content}
 										<span class="inline-block w-2 h-4 bg-exo-yellow/70 ml-1 cursor-blink"></span>
 									{/if}
@@ -448,20 +419,7 @@ function isThinkingExpanded(messageId: string): boolean {
 								</svg>
 							</button>
 						{/if}
-
-						<!-- Uncertainty view toggle (assistant messages with tokens only) -->
-						{#if message.role === 'assistant' && message.tokens && message.tokens.length > 0}
-							<button
-								onclick={() => toggleUncertaintyView(message.id)}
-								class="p-1.5 transition-colors rounded cursor-pointer {isUncertaintyViewEnabled(message.id) ? 'text-exo-yellow' : 'text-exo-light-gray hover:text-exo-yellow'}"
-								title={isUncertaintyViewEnabled(message.id) ? 'Hide uncertainty' : 'Show uncertainty'}
-							>
-								<svg class="w-3.5 h-3.5" fill="none" viewBox="0 0 24 24" stroke="currentColor">
-									<path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M9 19v-6a2 2 0 00-2-2H5a2 2 0 00-2 2v6a2 2 0 002 2h2a2 2 0 002-2zm0 0V9a2 2 0 012-2h2a2 2 0 012 2v10m-6 0a2 2 0 002 2h2a2 2 0 002-2m0 0V5a2 2 0 012-2h2a2 2 0 012 2v14a2 2 0 01-2 2h-2a2 2 0 01-2-2z" />
-								</svg>
-							</button>
-						{/if}
-
+						
 						<!-- Delete button -->
 						<button
 							onclick={() => handleDeleteClick(message.id)}
--- a/dashboard/src/lib/components/ModelCard.svelte
+++ b/dashboard/src/lib/components/ModelCard.svelte
@@ -445,12 +445,6 @@ function toggleNodeDetails(nodeId: string): void {
 								<feMergeNode in="SourceGraphic"/>
 							</feMerge>
 						</filter>
-						
-						<!-- Strong glow for new memory -->
-						<filter id="memGlow-{filterId}" x="-100%" y="-100%" width="300%" height="300%">
-							<feGaussianBlur stdDeviation="3" result="blur"/>
-							<feComposite in="SourceGraphic" in2="blur" operator="over"/>
-						</filter>
 					</defs>
 					
 					<!-- Connection lines between nodes (if multiple) -->
@@ -558,7 +552,7 @@ function toggleNodeDetails(nodeId: string): void {
 										height={node.currentFillHeight}
 										fill="#374151"
 									/>
-									<!-- New model memory fill (glowing yellow) -->
+									<!-- New model memory fill (yellow) -->
 									{#if node.modelUsageGB > 0 && node.isUsed}
 										<rect 
 											x="4" 
@@ -566,8 +560,7 @@ function toggleNodeDetails(nodeId: string): void {
 											width={node.iconSize - 8} 
 											height={node.modelFillHeight}
 											fill="#FFD700"
-											filter="url(#memGlow-{filterId})"
-											class="animate-pulse-slow"
+											opacity="0.9"
 										/>
 									{/if}
 									<!-- Base/keyboard -->
@@ -611,8 +604,7 @@ function toggleNodeDetails(nodeId: string): void {
 											width={node.iconSize - 8} 
 											height={(node.iconSize - 8) * ((node.newPercent - node.currentPercent) / 100)}
 											fill="#FFD700"
-											filter="url(#memGlow-{filterId})"
-											class="animate-pulse-slow"
+											opacity="0.9"
 										/>
 									{/if}
 								</g>
@@ -649,8 +641,7 @@ function toggleNodeDetails(nodeId: string): void {
 											width={node.iconSize - 8} 
 											height={(node.iconSize * 0.36) * ((node.newPercent - node.currentPercent) / 100)}
 											fill="#FFD700"
-											filter="url(#memGlow-{filterId})"
-											class="animate-pulse-slow"
+											opacity="0.9"
 										/>
 									{/if}
 								</g>
@@ -709,11 +700,5 @@ function toggleNodeDetails(nodeId: string): void {
 </div>

 <style>
-	@keyframes pulse-slow {
-		0%, 100% { opacity: 0.8; }
-		50% { opacity: 1; }
-	}
-	.animate-pulse-slow {
-		animation: pulse-slow 1.5s ease-in-out infinite;
-	}
+	/* Styles removed - animations were causing GPU overhead */
 </style>
--- a/dashboard/src/lib/components/TokenHeatmap.svelte
+++ b/dashboard/src/lib/components/TokenHeatmap.svelte
@@ -1,192 +0,0 @@
-<script lang="ts">
-	import type { TokenData } from '$lib/stores/app.svelte';
-
-	interface Props {
-		tokens: TokenData[];
-		class?: string;
-		isGenerating?: boolean;
-		onRegenerateFrom?: (tokenIndex: number) => void;
-	}
-
-	let { tokens, class: className = '', isGenerating = false, onRegenerateFrom }: Props = $props();
-
-	// Tooltip state - track both token data and index
-	let hoveredTokenIndex = $state<number | null>(null);
-	let hoveredPosition = $state<{ x: number; y: number } | null>(null);
-	let isTooltipHovered = $state(false);
-	let hideTimeoutId: ReturnType<typeof setTimeout> | null = null;
-
-	// Derive the hovered token from the index (stable across re-renders)
-	const hoveredToken = $derived(
-		hoveredTokenIndex !== null && hoveredPosition && tokens[hoveredTokenIndex]
-			? { token: tokens[hoveredTokenIndex], index: hoveredTokenIndex, ...hoveredPosition }
-			: null
-	);
-
-	/**
-	 * Get confidence styling based on probability.
-	 * Following Apple design principles: high confidence tokens blend in,
-	 * only uncertainty draws attention.
-	 */
-	function getConfidenceClass(probability: number): string {
-		if (probability > 0.8) return 'text-inherit'; // Expected tokens - blend in
-		if (probability > 0.5) return 'bg-gray-500/10 text-inherit'; // Slight hint
-		if (probability > 0.2) return 'bg-amber-500/15 text-amber-200/90'; // Subtle warmth
-		return 'bg-red-500/20 text-red-200/90'; // Draws attention
-	}
-
-	/**
-	 * Get border/underline styling for uncertain tokens
-	 */
-	function getBorderClass(probability: number): string {
-		if (probability > 0.8) return 'border-transparent'; // No border for expected
-		if (probability > 0.5) return 'border-gray-500/20';
-		if (probability > 0.2) return 'border-amber-500/30';
-		return 'border-red-500/40';
-	}
-
-	function clearHideTimeout() {
-		if (hideTimeoutId) {
-			clearTimeout(hideTimeoutId);
-			hideTimeoutId = null;
-		}
-	}
-
-	function handleMouseEnter(event: MouseEvent, token: TokenData, index: number) {
-		clearHideTimeout();
-		const rect = (event.target as HTMLElement).getBoundingClientRect();
-		hoveredTokenIndex = index;
-		hoveredPosition = {
-			x: rect.left + rect.width / 2,
-			y: rect.top - 10
-		};
-	}
-
-	function handleMouseLeave() {
-		clearHideTimeout();
-		// Use longer delay during generation to account for re-renders
-		const delay = isGenerating ? 300 : 100;
-		hideTimeoutId = setTimeout(() => {
-			if (!isTooltipHovered) {
-				hoveredTokenIndex = null;
-				hoveredPosition = null;
-			}
-		}, delay);
-	}
-
-	function handleTooltipEnter() {
-		clearHideTimeout();
-		isTooltipHovered = true;
-	}
-
-	function handleTooltipLeave() {
-		isTooltipHovered = false;
-		hoveredTokenIndex = null;
-		hoveredPosition = null;
-	}
-
-	function handleRegenerate() {
-		if (hoveredToken && onRegenerateFrom) {
-			const indexToRegenerate = hoveredToken.index;
-			// Clear hover state immediately
-			hoveredTokenIndex = null;
-			hoveredPosition = null;
-			isTooltipHovered = false;
-			// Call regenerate
-			onRegenerateFrom(indexToRegenerate);
-		}
-	}
-
-	function formatProbability(prob: number): string {
-		return (prob * 100).toFixed(1) + '%';
-	}
-
-	function formatLogprob(logprob: number): string {
-		return logprob.toFixed(3);
-	}
-
-	function getProbabilityColor(probability: number): string {
-		if (probability > 0.8) return 'text-gray-300';
-		if (probability > 0.5) return 'text-gray-400';
-		if (probability > 0.2) return 'text-amber-400';
-		return 'text-red-400';
-	}
-</script>
-
-<div class="token-heatmap leading-relaxed {className}">
-	{#each tokens as tokenData, i (i)}
-		<span
-			role="button"
-			tabindex="0"
-			class="token-span inline rounded px-0.5 py-0.5 cursor-pointer transition-all duration-150 border {getConfidenceClass(tokenData.probability)} {getBorderClass(tokenData.probability)} hover:opacity-80"
-			onmouseenter={(e) => handleMouseEnter(e, tokenData, i)}
-			onmouseleave={handleMouseLeave}
-		>{tokenData.token}</span>
-	{/each}
-</div>
-
-<!-- Tooltip -->
-{#if hoveredToken}
-	<div
-		class="fixed z-50"
-		style="left: {hoveredToken.x}px; top: {hoveredToken.y}px; transform: translate(-50%, -100%);"
-		onmouseenter={handleTooltipEnter}
-		onmouseleave={handleTooltipLeave}
-	>
-		<div class="bg-gray-900/95 backdrop-blur-sm border border-gray-700/50 rounded-xl shadow-xl p-3 text-sm min-w-48">
-			<!-- Token info -->
-			<div class="mb-2">
-				<span class="text-gray-500 text-xs">Token:</span>
-				<span class="text-white font-mono ml-1">"{hoveredToken.token.token}"</span>
-				<span class="{getProbabilityColor(hoveredToken.token.probability)} ml-2">{formatProbability(hoveredToken.token.probability)}</span>
-			</div>
-
-			<div class="text-gray-400 text-xs mb-1">
-				logprob: <span class="text-gray-300 font-mono">{formatLogprob(hoveredToken.token.logprob)}</span>
-			</div>
-
-			<!-- Top alternatives -->
-			{#if hoveredToken.token.topLogprobs.length > 0}
-				<div class="border-t border-gray-700/50 mt-2 pt-2">
-					<div class="text-gray-500 text-xs mb-1">Alternatives:</div>
-					{#each hoveredToken.token.topLogprobs.slice(0, 5) as alt, idx (idx)}
-						{@const altProb = Math.exp(alt.logprob)}
-						<div class="flex justify-between items-center text-xs py-0.5">
-							<span class="text-gray-300 font-mono truncate max-w-24">"{alt.token}"</span>
-							<span class="text-gray-400 ml-2">{formatProbability(altProb)}</span>
-						</div>
-					{/each}
-				</div>
-			{/if}
-
-			<!-- Regenerate button -->
-			{#if onRegenerateFrom}
-				<button
-					onclick={handleRegenerate}
-					class="w-full mt-2 pt-2 border-t border-gray-700/50 flex items-center justify-center gap-1.5 text-xs text-gray-400 hover:text-white transition-colors cursor-pointer"
-				>
-					<svg class="w-3 h-3" fill="none" viewBox="0 0 24 24" stroke="currentColor">
-						<path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M4 4v5h.582m15.356 2A8.001 8.001 0 004.582 9m0 0H9m11 11v-5h-.581m0 0a8.003 8.003 0 01-15.357-2m15.357 2H15" />
-					</svg>
-					Regenerate from here
-				</button>
-			{/if}
-		</div>
-		<!-- Arrow -->
-		<div class="absolute left-1/2 -translate-x-1/2 top-full">
-			<div class="border-8 border-transparent border-t-gray-900"></div>
-		</div>
-	</div>
-{/if}
-
-<style>
-	.token-heatmap {
-		word-wrap: break-word;
-		white-space: pre-wrap;
-	}
-
-	.token-span {
-		margin: 0;
-		border-width: 1px;
-	}
-</style>
--- a/dashboard/src/lib/components/TopologyGraph.svelte
+++ b/dashboard/src/lib/components/TopologyGraph.svelte
--- a/dashboard/src/lib/components/index.ts
+++ b/dashboard/src/lib/components/index.ts
@@ -1,7 +1,8 @@
-export { default as TopologyGraph } from "./TopologyGraph.svelte";
-export { default as ChatForm } from "./ChatForm.svelte";
-export { default as ChatMessages } from "./ChatMessages.svelte";
-export { default as ChatAttachments } from "./ChatAttachments.svelte";
-export { default as ChatSidebar } from "./ChatSidebar.svelte";
-export { default as ModelCard } from "./ModelCard.svelte";
-export { default as MarkdownContent } from "./MarkdownContent.svelte";
+export { default as TopologyGraph } from './TopologyGraph.svelte';
+export { default as ChatForm } from './ChatForm.svelte';
+export { default as ChatMessages } from './ChatMessages.svelte';
+export { default as ChatAttachments } from './ChatAttachments.svelte';
+export { default as ChatSidebar } from './ChatSidebar.svelte';
+export { default as ModelCard } from './ModelCard.svelte';
+export { default as MarkdownContent } from './MarkdownContent.svelte';
+
--- a/dashboard/src/lib/stores/app.svelte.ts
+++ b/dashboard/src/lib/stores/app.svelte.ts
--- a/dashboard/src/lib/types/files.ts
+++ b/dashboard/src/lib/types/files.ts
@@ -13,124 +13,55 @@ export interface ChatUploadedFile {
 }

 export interface ChatAttachment {
-	type: "image" | "text" | "pdf" | "audio";
+	type: 'image' | 'text' | 'pdf' | 'audio';
 	name: string;
 	content?: string;
 	base64Url?: string;
 	mimeType?: string;
 }

-export type FileCategory = "image" | "text" | "pdf" | "audio" | "unknown";
+export type FileCategory = 'image' | 'text' | 'pdf' | 'audio' | 'unknown';

-export const IMAGE_EXTENSIONS = [
-	".jpg",
-	".jpeg",
-	".png",
-	".gif",
-	".webp",
-	".svg",
-];
-export const IMAGE_MIME_TYPES = [
-	"image/jpeg",
-	"image/png",
-	"image/gif",
-	"image/webp",
-	"image/svg+xml",
-];
+export const IMAGE_EXTENSIONS = ['.jpg', '.jpeg', '.png', '.gif', '.webp', '.svg'];
+export const IMAGE_MIME_TYPES = ['image/jpeg', 'image/png', 'image/gif', 'image/webp', 'image/svg+xml'];

 export const TEXT_EXTENSIONS = [
-	".txt",
-	".md",
-	".json",
-	".xml",
-	".yaml",
-	".yml",
-	".csv",
-	".log",
-	".js",
-	".ts",
-	".jsx",
-	".tsx",
-	".py",
-	".java",
-	".cpp",
-	".c",
-	".h",
-	".css",
-	".html",
-	".htm",
-	".sql",
-	".sh",
-	".bat",
-	".rs",
-	".go",
-	".rb",
-	".php",
-	".swift",
-	".kt",
-	".scala",
-	".r",
-	".dart",
-	".vue",
-	".svelte",
+	'.txt', '.md', '.json', '.xml', '.yaml', '.yml', '.csv', '.log',
+	'.js', '.ts', '.jsx', '.tsx', '.py', '.java', '.cpp', '.c', '.h',
+	'.css', '.html', '.htm', '.sql', '.sh', '.bat', '.rs', '.go',
+	'.rb', '.php', '.swift', '.kt', '.scala', '.r', '.dart', '.vue', '.svelte'
 ];
 export const TEXT_MIME_TYPES = [
-	"text/plain",
-	"text/markdown",
-	"text/csv",
-	"text/html",
-	"text/css",
-	"application/json",
-	"application/xml",
-	"text/xml",
-	"application/javascript",
-	"text/javascript",
-	"application/typescript",
+	'text/plain', 'text/markdown', 'text/csv', 'text/html', 'text/css',
+	'application/json', 'application/xml', 'text/xml', 'application/javascript',
+	'text/javascript', 'application/typescript'
 ];

-export const PDF_EXTENSIONS = [".pdf"];
-export const PDF_MIME_TYPES = ["application/pdf"];
+export const PDF_EXTENSIONS = ['.pdf'];
+export const PDF_MIME_TYPES = ['application/pdf'];

-export const AUDIO_EXTENSIONS = [".mp3", ".wav", ".ogg", ".m4a"];
-export const AUDIO_MIME_TYPES = [
-	"audio/mpeg",
-	"audio/wav",
-	"audio/ogg",
-	"audio/mp4",
-];
+export const AUDIO_EXTENSIONS = ['.mp3', '.wav', '.ogg', '.m4a'];
+export const AUDIO_MIME_TYPES = ['audio/mpeg', 'audio/wav', 'audio/ogg', 'audio/mp4'];

 /**
 * Get file category based on MIME type and extension
 */
-export function getFileCategory(
-	mimeType: string,
-	fileName: string,
-): FileCategory {
-	const extension = fileName.toLowerCase().slice(fileName.lastIndexOf("."));
-
-	if (
-		IMAGE_MIME_TYPES.includes(mimeType) ||
-		IMAGE_EXTENSIONS.includes(extension)
-	) {
-		return "image";
+export function getFileCategory(mimeType: string, fileName: string): FileCategory {
+	const extension = fileName.toLowerCase().slice(fileName.lastIndexOf('.'));
+	
+	if (IMAGE_MIME_TYPES.includes(mimeType) || IMAGE_EXTENSIONS.includes(extension)) {
+		return 'image';
 	}
 	if (PDF_MIME_TYPES.includes(mimeType) || PDF_EXTENSIONS.includes(extension)) {
-		return "pdf";
+		return 'pdf';
 	}
-	if (
-		AUDIO_MIME_TYPES.includes(mimeType) ||
-		AUDIO_EXTENSIONS.includes(extension)
-	) {
-		return "audio";
+	if (AUDIO_MIME_TYPES.includes(mimeType) || AUDIO_EXTENSIONS.includes(extension)) {
+		return 'audio';
 	}
-	if (
-		TEXT_MIME_TYPES.includes(mimeType) ||
-		TEXT_EXTENSIONS.includes(extension) ||
-		mimeType.startsWith("text/")
-	) {
-		return "text";
+	if (TEXT_MIME_TYPES.includes(mimeType) || TEXT_EXTENSIONS.includes(extension) || mimeType.startsWith('text/')) {
+		return 'text';
 	}
-	return "unknown";
+	return 'unknown';
 }

 /**
@@ -138,36 +69,36 @@ export function getFileCategory(
 */
 export function getAcceptString(categories: FileCategory[]): string {
 	const accepts: string[] = [];
-
+	
 	for (const category of categories) {
 		switch (category) {
-			case "image":
+			case 'image':
 				accepts.push(...IMAGE_EXTENSIONS, ...IMAGE_MIME_TYPES);
 				break;
-			case "text":
+			case 'text':
 				accepts.push(...TEXT_EXTENSIONS, ...TEXT_MIME_TYPES);
 				break;
-			case "pdf":
+			case 'pdf':
 				accepts.push(...PDF_EXTENSIONS, ...PDF_MIME_TYPES);
 				break;
-			case "audio":
+			case 'audio':
 				accepts.push(...AUDIO_EXTENSIONS, ...AUDIO_MIME_TYPES);
 				break;
 		}
 	}
-
-	return accepts.join(",");
+	
+	return accepts.join(',');
 }

 /**
 * Format file size for display
 */
 export function formatFileSize(bytes: number): string {
-	if (bytes === 0) return "0 B";
+	if (bytes === 0) return '0 B';
 	const k = 1024;
-	const sizes = ["B", "KB", "MB", "GB"];
+	const sizes = ['B', 'KB', 'MB', 'GB'];
 	const i = Math.floor(Math.log(bytes) / Math.log(k));
-	return parseFloat((bytes / Math.pow(k, i)).toFixed(1)) + " " + sizes[i];
+	return parseFloat((bytes / Math.pow(k, i)).toFixed(1)) + ' ' + sizes[i];
 }

 /**
@@ -197,44 +128,42 @@ export function readFileAsText(file: File): Promise<string> {
 /**
 * Process uploaded files into ChatUploadedFile format
 */
-export async function processUploadedFiles(
-	files: File[],
-): Promise<ChatUploadedFile[]> {
+export async function processUploadedFiles(files: File[]): Promise<ChatUploadedFile[]> {
 	const results: ChatUploadedFile[] = [];
-
+	
 	for (const file of files) {
-		const id =
-			Date.now().toString() + Math.random().toString(36).substring(2, 9);
+		const id = Date.now().toString() + Math.random().toString(36).substring(2, 9);
 		const category = getFileCategory(file.type, file.name);
-
+		
 		const base: ChatUploadedFile = {
 			id,
 			name: file.name,
 			size: file.size,
 			type: file.type,
-			file,
+			file
 		};
-
+		
 		try {
-			if (category === "image") {
+			if (category === 'image') {
 				const preview = await readFileAsDataURL(file);
 				results.push({ ...base, preview });
-			} else if (category === "text" || category === "unknown") {
+			} else if (category === 'text' || category === 'unknown') {
 				const textContent = await readFileAsText(file);
 				results.push({ ...base, textContent });
-			} else if (category === "pdf") {
+			} else if (category === 'pdf') {
 				results.push(base);
-			} else if (category === "audio") {
+			} else if (category === 'audio') {
 				const preview = await readFileAsDataURL(file);
 				results.push({ ...base, preview });
 			} else {
 				results.push(base);
 			}
 		} catch (error) {
-			console.error("Error processing file:", file.name, error);
+			console.error('Error processing file:', file.name, error);
 			results.push(base);
 		}
 	}
-
+	
 	return results;
 }
+
--- a/dashboard/src/routes/+page.svelte
+++ b/dashboard/src/routes/+page.svelte
@@ -51,59 +51,6 @@ const sidebarVisible = $derived(chatSidebarVisible());
 	let selectedSharding = $state<'Pipeline' | 'Tensor'>('Pipeline');
 	type InstanceMeta = 'MlxRing' | 'MlxIbv' | 'MlxJaccl';
 	
-	// Launch defaults persistence
-	const LAUNCH_DEFAULTS_KEY = 'exo-launch-defaults';
-	interface LaunchDefaults {
-		modelId: string | null;
-		sharding: 'Pipeline' | 'Tensor';
-		instanceType: InstanceMeta;
-		minNodes: number;
-	}
-	
-	function saveLaunchDefaults(): void {
-		const defaults: LaunchDefaults = {
-			modelId: selectedPreviewModelId(),
-			sharding: selectedSharding,
-			instanceType: selectedInstanceType,
-			minNodes: selectedMinNodes,
-		};
-		try {
-			localStorage.setItem(LAUNCH_DEFAULTS_KEY, JSON.stringify(defaults));
-		} catch (e) {
-			console.warn('Failed to save launch defaults:', e);
-		}
-	}
-	
-	function loadLaunchDefaults(): LaunchDefaults | null {
-		try {
-			const stored = localStorage.getItem(LAUNCH_DEFAULTS_KEY);
-			if (!stored) return null;
-			return JSON.parse(stored) as LaunchDefaults;
-		} catch (e) {
-			console.warn('Failed to load launch defaults:', e);
-			return null;
-		}
-	}
-	
-	function applyLaunchDefaults(availableModels: Array<{id: string}>, maxNodes: number): void {
-		const defaults = loadLaunchDefaults();
-		if (!defaults) return;
-		
-		// Apply sharding and instance type unconditionally
-		selectedSharding = defaults.sharding;
-		selectedInstanceType = defaults.instanceType;
-		
-		// Apply minNodes if valid (between 1 and maxNodes)
-		if (defaults.minNodes && defaults.minNodes >= 1 && defaults.minNodes <= maxNodes) {
-			selectedMinNodes = defaults.minNodes;
-		}
-		
-		// Only apply model if it exists in the available models
-		if (defaults.modelId && availableModels.some(m => m.id === defaults.modelId)) {
-			selectPreviewModel(defaults.modelId);
-		}
-	}
-	
 	let selectedInstanceType = $state<InstanceMeta>('MlxRing');
 	let selectedMinNodes = $state<number>(1);
 	let minNodesInitialized = $state(false);
@@ -351,9 +298,6 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 				const data = await response.json();
 				// API returns { data: [{ id, name }] } format
 				models = data.data || [];
-				// Restore last launch defaults if available
-				const currentNodeCount = topologyData() ? Object.keys(topologyData()!.nodes).length : 1;
-				applyLaunchDefaults(models, currentNodeCount);
 			}
 		} catch (error) {
 			console.error('Failed to fetch models:', error);
@@ -400,8 +344,10 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 				const errorText = await response.text();
 				console.error('Failed to launch instance:', errorText);
 			} else {
-				// Always auto-select the newly launched model so the user chats to what they just launched
-				setSelectedChatModel(modelId);
+				// Auto-select the launched model only if no model is currently selected
+				if (!selectedChatModel()) {
+					setSelectedChatModel(modelId);
+				}
 				
 				// Scroll to the bottom of instances container to show the new instance
 				// Use multiple attempts to ensure DOM has updated with the new instance
@@ -591,7 +537,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 		// Unwrap the instance
 		const [instanceTag, instance] = getTagged(instanceWrapped);
 		if (!instance || typeof instance !== 'object') {
-			return { isDownloading: false, progress: null, statusText: 'PREPARING', perNode: [] };
+			return { isDownloading: false, progress: null, statusText: 'UNKNOWN', perNode: [] };
 		}

 		const inst = instance as { shardAssignments?: { nodeToRunner?: Record<string, string>; runnerToShard?: Record<string, unknown>; modelId?: string } };
@@ -704,7 +650,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 	function deriveInstanceStatus(instanceWrapped: unknown): { statusText: string; statusClass: string } {
 		const [, instance] = getTagged(instanceWrapped);
 		if (!instance || typeof instance !== 'object') {
-			return { statusText: 'PREPARING', statusClass: 'inactive' };
+			return { statusText: 'UNKNOWN', statusClass: 'inactive' };
 		}
 		
 		const inst = instance as { shardAssignments?: { runnerToShard?: Record<string, unknown> } };
@@ -733,7 +679,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {

 		const has = (s: string) => statuses.includes(s);

-		if (statuses.length === 0) return { statusText: 'PREPARING', statusClass: 'inactive' };
+		if (statuses.length === 0) return { statusText: 'UNKNOWN', statusClass: 'inactive' };
 		if (has('Failed')) return { statusText: 'FAILED', statusClass: 'failed' };
 		if (has('Shutdown')) return { statusText: 'SHUTDOWN', statusClass: 'inactive' };
 		if (has('Loading')) return { statusText: 'LOADING', statusClass: 'starting' };
@@ -761,10 +707,6 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 	async function deleteInstance(instanceId: string) {
 		if (!confirm(`Delete instance ${instanceId.slice(0, 8)}...?`)) return;
 		
-		// Get the model ID of the instance being deleted before we delete it
-		const deletedInstanceModelId = getInstanceModelId(instanceData[instanceId]);
-		const wasSelected = selectedChatModel() === deletedInstanceModelId;
-		
 		try {
 			const response = await fetch(`/instance/${instanceId}`, {
 				method: 'DELETE',
@@ -773,24 +715,6 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 			
 			if (!response.ok) {
 				console.error('Failed to delete instance:', response.status);
-			} else if (wasSelected) {
-				// If we deleted the currently selected model, switch to another available model
-				// Find another instance that isn't the one we just deleted
-				const remainingInstances = Object.entries(instanceData).filter(([id]) => id !== instanceId);
-				if (remainingInstances.length > 0) {
-					// Select the last instance (most recently added, since objects preserve insertion order)
-					const [, lastInstance] = remainingInstances[remainingInstances.length - 1];
-					const newModelId = getInstanceModelId(lastInstance);
-					if (newModelId && newModelId !== 'Unknown' && newModelId !== 'Unknown Model') {
-						setSelectedChatModel(newModelId);
-					} else {
-						// Clear selection if no valid model found
-						setSelectedChatModel('');
-					}
-				} else {
-					// No more instances, clear the selection
-					setSelectedChatModel('');
-				}
 			}
 		} catch (error) {
 			console.error('Error deleting instance:', error);
@@ -1064,7 +988,6 @@ function toggleInstanceDownloadDetails(nodeId: string): void {

 	function handleSliderMouseUp() {
 		isDraggingSlider = false;
-		saveLaunchDefaults();
 	}

 	// Handle touch events for mobile
@@ -1084,7 +1007,6 @@ function toggleInstanceDownloadDetails(nodeId: string): void {

 	function handleSliderTouchEnd() {
 		isDraggingSlider = false;
-		saveLaunchDefaults();
 	}

 	const nodeCount = $derived(data ? Object.keys(data.nodes).length : 0);
@@ -1287,9 +1209,9 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 							<div class="flex-1 h-px bg-gradient-to-r from-exo-yellow/30 to-transparent"></div>
 						</div>
 						
-						<div
+						<div 
 							bind:this={instancesContainerRef}
-							class="max-h-72 xl:max-h-96 space-y-3 overflow-y-auto overflow-x-hidden py-px"
+							class="max-h-72 space-y-3 overflow-y-auto"
 						>
 								{#each Object.entries(instanceData) as [id, instance]}
 									{@const downloadInfo = getInstanceDownloadStatus(id, instance)}
@@ -1542,7 +1464,6 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 												onclick={() => {
 													if (modelCanFit) {
 														selectPreviewModel(model.id);
-														saveLaunchDefaults();
 														isModelDropdownOpen = false;
 														modelDropdownSearch = '';
 													}
@@ -1576,7 +1497,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 								<div class="text-xs text-white/70 font-mono mb-2">Sharding:</div>
 								<div class="flex gap-2">
 									<button 
-										onclick={() => { selectedSharding = 'Pipeline'; saveLaunchDefaults(); }}
+										onclick={() => selectedSharding = 'Pipeline'}
 										class="flex items-center gap-2 py-2 px-4 text-sm font-mono border rounded transition-all duration-200 cursor-pointer {selectedSharding === 'Pipeline' ? 'bg-transparent text-exo-yellow border-exo-yellow' : 'bg-transparent text-white/70 border-exo-medium-gray/50 hover:border-exo-yellow/50'}"
 									>
 										<span class="w-4 h-4 rounded-full border-2 flex items-center justify-center {selectedSharding === 'Pipeline' ? 'border-exo-yellow' : 'border-exo-medium-gray'}">
@@ -1587,7 +1508,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 										Pipeline
 									</button>
 									<button 
-										onclick={() => { selectedSharding = 'Tensor'; saveLaunchDefaults(); }}
+										onclick={() => selectedSharding = 'Tensor'}
 										class="flex items-center gap-2 py-2 px-4 text-sm font-mono border rounded transition-all duration-200 cursor-pointer {selectedSharding === 'Tensor' ? 'bg-transparent text-exo-yellow border-exo-yellow' : 'bg-transparent text-white/70 border-exo-medium-gray/50 hover:border-exo-yellow/50'}"
 									>
 										<span class="w-4 h-4 rounded-full border-2 flex items-center justify-center {selectedSharding === 'Tensor' ? 'border-exo-yellow' : 'border-exo-medium-gray'}">
@@ -1605,7 +1526,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 								<div class="text-xs text-white/70 font-mono mb-2">Instance Type:</div>
 								<div class="flex gap-2">
 									<button 
-										onclick={() => { selectedInstanceType = 'MlxRing'; saveLaunchDefaults(); }}
+										onclick={() => selectedInstanceType = 'MlxRing'}
 										class="flex items-center gap-2 py-2 px-4 text-sm font-mono border rounded transition-all duration-200 cursor-pointer {selectedInstanceType === 'MlxRing' ? 'bg-transparent text-exo-yellow border-exo-yellow' : 'bg-transparent text-white/70 border-exo-medium-gray/50 hover:border-exo-yellow/50'}"
 									>
 										<span class="w-4 h-4 rounded-full border-2 flex items-center justify-center {selectedInstanceType === 'MlxRing' ? 'border-exo-yellow' : 'border-exo-medium-gray'}">
@@ -1616,7 +1537,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 										MLX Ring
 									</button>
 									<button 
-										onclick={() => { selectedInstanceType = 'MlxIbv'; saveLaunchDefaults(); }}
+										onclick={() => selectedInstanceType = 'MlxIbv'}
 										class="flex items-center gap-2 py-2 px-4 text-sm font-mono border rounded transition-all duration-200 cursor-pointer {selectedInstanceType === 'MlxIbv' ? 'bg-transparent text-exo-yellow border-exo-yellow' : 'bg-transparent text-white/70 border-exo-medium-gray/50 hover:border-exo-yellow/50'}"
 									>
 										<span class="w-4 h-4 rounded-full border-2 flex items-center justify-center {selectedInstanceType === 'MlxIbv' ? 'border-exo-yellow' : 'border-exo-medium-gray'}">
@@ -1793,7 +1714,7 @@ function toggleInstanceDownloadDetails(nodeId: string): void {
 								<h3 class="text-xs text-exo-yellow font-mono tracking-[0.2em] uppercase">Instances</h3>
 								<div class="flex-1 h-px bg-gradient-to-r from-exo-yellow/30 to-transparent"></div>
 							</div>
-								<div class="space-y-3 max-h-72 xl:max-h-96 overflow-y-auto overflow-x-hidden py-px pr-1">
+								<div class="space-y-3 max-h-72 overflow-y-auto pr-1">
 									{#each Object.entries(instanceData) as [id, instance]}
 										{@const downloadInfo = getInstanceDownloadStatus(id, instance)}
 										{@const statusText = downloadInfo.statusText}
--- a/dashboard/src/routes/downloads/+page.svelte
+++ b/dashboard/src/routes/downloads/+page.svelte
@@ -199,13 +199,7 @@
 					const rawProgress = (downloadPayload as Record<string, unknown>).download_progress
 						?? (downloadPayload as Record<string, unknown>).downloadProgress
 						?? {};
-					// For DownloadCompleted, total_bytes is at top level; for DownloadOngoing, it's inside download_progress
-					const totalBytes = getBytes(
-						(downloadPayload as Record<string, unknown>).total_bytes
-						?? (downloadPayload as Record<string, unknown>).totalBytes
-						?? (rawProgress as Record<string, unknown>).total_bytes
-						?? (rawProgress as Record<string, unknown>).totalBytes
-					);
+					const totalBytes = getBytes((rawProgress as Record<string, unknown>).total_bytes ?? (rawProgress as Record<string, unknown>).totalBytes);
 					const downloadedBytes = getBytes((rawProgress as Record<string, unknown>).downloaded_bytes ?? (rawProgress as Record<string, unknown>).downloadedBytes);
 					const speed = (rawProgress as Record<string, unknown>).speed as number ?? 0;
 					const etaMs = (rawProgress as Record<string, unknown>).eta_ms as number ?? (rawProgress as Record<string, unknown>).etaMs as number ?? 0;
@@ -338,13 +332,8 @@
 								<div class="text-lg font-mono text-white truncate">{node.nodeName}</div>
 								<div class="text-xs text-exo-light-gray font-mono truncate">{node.nodeId}</div>
 							</div>
-							<div class="text-xs font-mono uppercase tracking-wider whitespace-nowrap shrink-0 text-right">
-								<div>
-									<span class="text-green-400">{node.models.filter(m => m.status === 'completed').length}</span><span class="text-exo-yellow"> / {node.models.length} models</span>
-								</div>
-								<div class="text-exo-light-gray normal-case tracking-normal">
-									{formatBytes(node.models.filter(m => m.status === 'completed').reduce((sum, m) => sum + m.totalBytes, 0))} on disk
-								</div>
+							<div class="text-xs font-mono uppercase tracking-wider whitespace-nowrap shrink-0">
+								<span class="text-green-400">{node.models.filter(m => m.status === 'completed').length}</span><span class="text-exo-yellow"> /{node.models.length} models</span>
 							</div>
 						</div>

@@ -396,7 +385,7 @@
 								</div>

 								<div class="flex items-center justify-between text-xs font-mono text-exo-light-gray">
-									<span>{model.status === 'completed' ? `Completed (${formatBytes(model.totalBytes)})` : `${formatSpeed(model.speed)} • ETA ${formatEta(model.etaMs)}`}</span>
+									<span>{model.status === 'completed' ? 'Completed' : `${formatSpeed(model.speed)} • ETA ${formatEta(model.etaMs)}`}</span>
 									{#if model.status !== 'completed'}
 										<span>{model.files.length} file{model.files.length === 1 ? '' : 's'}</span>
 									{/if}
--- a/dashboard/vite.config.ts
+++ b/dashboard/vite.config.ts
@@ -1,15 +1,16 @@
-import tailwindcss from "@tailwindcss/vite";
-import { sveltekit } from "@sveltejs/kit/vite";
-import { defineConfig } from "vite";
+import tailwindcss from '@tailwindcss/vite';
+import { sveltekit } from '@sveltejs/kit/vite';
+import { defineConfig } from 'vite';

 export default defineConfig({
 	plugins: [tailwindcss(), sveltekit()],
 	server: {
 		proxy: {
-			"/v1": "http://localhost:52415",
-			"/state": "http://localhost:52415",
-			"/models": "http://localhost:52415",
-			"/instance": "http://localhost:52415",
-		},
-	},
+			'/v1': 'http://localhost:52415',
+			'/state': 'http://localhost:52415',
+			'/models': 'http://localhost:52415',
+			'/instance': 'http://localhost:52415'
+		}
+	}
 });
+
--- a/docs/api.md
+++ b/docs/api.md
@@ -1,212 +0,0 @@
-# EXO API – Technical Reference
-
-This document describes the REST API exposed by the **EXO ** service, as implemented in:
-
-`src/exo/master/api.py`
-
-The API is used to manage model instances in the cluster, inspect cluster state, and perform inference using an OpenAI-compatible interface.
-
-Base URL example:
-
-```
-http://localhost:52415
-```
-
-## 1. General / Meta Endpoints
-
-### Get Master Node ID
-
-**GET** `/node_id`
-
-Returns the identifier of the current master node.
-
-**Response (example):**
-
-```json
-{
-  "node_id": "node-1234"
-}
-```
-
-### Get Cluster State
-
-**GET** `/state`
-
-Returns the current state of the cluster, including nodes and active instances.
-
-**Response:**
-JSON object describing topology, nodes, and instances.
-
-### Get Events
-
-**GET** `/events`
-
-Returns the list of internal events recorded by the master (mainly for debugging and observability).
-
-**Response:**
-Array of event objects.
-
-## 2. Model Instance Management
-
-### Create Instance
-
-**POST** `/instance`
-
-Creates a new model instance in the cluster.
-
-**Request body (example):**
-
-```json
-{
-  "instance": {
-    "model_id": "llama-3.2-1b",
-    "placement": { }
-  }
-}
-```
-
-**Response:**
-JSON description of the created instance.
-
-### Delete Instance
-
-**DELETE** `/instance/{instance_id}`
-
-Deletes an existing instance by ID.
-
-**Path parameters:**
-
-* `instance_id`: string, ID of the instance to delete
-
-**Response:**
-Status / confirmation JSON.
-
-### Get Instance
-
-**GET** `/instance/{instance_id}`
-
-Returns details of a specific instance.
-
-**Path parameters:**
-
-* `instance_id`: string
-
-**Response:**
-JSON description of the instance.
-
-### Preview Placements
-
-**GET** `/instance/previews?model_id=...`
-
-Returns possible placement previews for a given model.
-
-**Query parameters:**
-
-* `model_id`: string, required
-
-**Response:**
-Array of placement preview objects.
-
-### Compute Placement
-
-**GET** `/instance/placement`
-
-Computes a placement for a potential instance without creating it.
-
-**Query parameters (typical):**
-
-* `model_id`: string
-* `sharding`: string or config
-* `instance_meta`: JSON-encoded metadata
-* `min_nodes`: integer
-
-**Response:**
-JSON object describing the proposed placement / instance configuration.
-
-### Place Instance (Dry Operation)
-
-**POST** `/place_instance`
-
-Performs a placement operation for an instance (planning step), without necessarily creating it.
-
-**Request body:**
-JSON describing the instance to be placed.
-
-**Response:**
-Placement result.
-
-## 3. Models
-
-### List Models
-
-**GET** `/models`
-**GET** `/v1/models` (alias)
-
-Returns the list of available models and their metadata.
-
-**Response:**
-Array of model descriptors.
-
-## 4. Inference / Chat Completions
-
-### OpenAI-Compatible Chat Completions
-
-**POST** `/v1/chat/completions`
-
-Executes a chat completion request using an OpenAI-compatible schema. Supports streaming and non-streaming modes.
-
-**Request body (example):**
-
-```json
-{
-  "model": "llama-3.2-1b",
-  "messages": [
-    { "role": "system", "content": "You are a helpful assistant." },
-    { "role": "user", "content": "Hello" }
-  ],
-  "stream": false
-}
-```
-
-**Response:**
-OpenAI-compatible chat completion response.
-
-### Benchmarked Chat Completions
-
-**POST** `/bench/chat/completions`
-
-Same as `/v1/chat/completions`, but also returns performance and generation statistics.
-
-**Request body:**
-Same schema as `/v1/chat/completions`.
-
-**Response:**
-Chat completion plus benchmarking metrics.
-
-## 5. Complete Endpoint Summary
-
-```
-GET     /node_id
-GET     /state
-GET     /events
-
-POST    /instance
-GET     /instance/{instance_id}
-DELETE  /instance/{instance_id}
-
-GET     /instance/previews
-GET     /instance/placement
-POST    /place_instance
-
-GET     /models
-GET     /v1/models
-
-POST    /v1/chat/completions
-POST    /bench/chat/completions
-```
-
-## 6. Notes
-
-* The `/v1/chat/completions` endpoint is compatible with the OpenAI API format, so existing OpenAI clients can be pointed to EXO by changing the base URL.
-* The instance placement endpoints allow you to plan and preview cluster allocations before actually creating instances.
-* The `/events` and `/state` endpoints are primarily intended for operational visibility and debugging.
--- a/flake.lock
+++ b/flake.lock
@@ -1,42 +1,5 @@
 {
  "nodes": {
-    "crane": {
-      "locked": {
-        "lastModified": 1767744144,
-        "narHash": "sha256-9/9ntI0D+HbN4G0TrK3KmHbTvwgswz7p8IEJsWyef8Q=",
-        "owner": "ipetkov",
-        "repo": "crane",
-        "rev": "2fb033290bf6b23f226d4c8b32f7f7a16b043d7e",
-        "type": "github"
-      },
-      "original": {
-        "owner": "ipetkov",
-        "repo": "crane",
-        "type": "github"
-      }
-    },
-    "dream2nix": {
-      "inputs": {
-        "nixpkgs": [
-          "nixpkgs"
-        ],
-        "purescript-overlay": "purescript-overlay",
-        "pyproject-nix": "pyproject-nix"
-      },
-      "locked": {
-        "lastModified": 1765953015,
-        "narHash": "sha256-5FBZbbWR1Csp3Y2icfRkxMJw/a/5FGg8hCXej2//bbI=",
-        "owner": "nix-community",
-        "repo": "dream2nix",
-        "rev": "69eb01fa0995e1e90add49d8ca5bcba213b0416f",
-        "type": "github"
-      },
-      "original": {
-        "owner": "nix-community",
-        "repo": "dream2nix",
-        "type": "github"
-      }
-    },
    "fenix": {
      "inputs": {
        "nixpkgs": [
@@ -45,11 +8,11 @@
        "rust-analyzer-src": "rust-analyzer-src"
      },
      "locked": {
-        "lastModified": 1768287139,
-        "narHash": "sha256-nsXFt0OzUi6K7dUzzJD5/v9e0Ic+fvclfIW936/43ZM=",
+        "lastModified": 1761893049,
+        "narHash": "sha256-1TtFDPhC+ZsrOOtBnry1EZC+WipTTvsOVjIEVugqji8=",
        "owner": "nix-community",
        "repo": "fenix",
-        "rev": "a4a3aa956931f90f35453cb519e4545e9ad7f773",
+        "rev": "c2ac9a5c0d6d16630c3b225b874bd14528d1abe6",
        "type": "github"
      },
      "original": {
@@ -58,59 +21,25 @@
        "type": "github"
      }
    },
-    "flake-compat": {
-      "flake": false,
-      "locked": {
-        "lastModified": 1696426674,
-        "narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
-        "owner": "edolstra",
-        "repo": "flake-compat",
-        "rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
-        "type": "github"
-      },
-      "original": {
-        "owner": "edolstra",
-        "repo": "flake-compat",
-        "type": "github"
-      }
-    },
-    "flake-parts": {
+    "flake-utils": {
      "inputs": {
-        "nixpkgs-lib": [
-          "nixpkgs"
-        ]
+        "systems": "systems"
      },
      "locked": {
-        "lastModified": 1768135262,
-        "narHash": "sha256-PVvu7OqHBGWN16zSi6tEmPwwHQ4rLPU9Plvs8/1TUBY=",
-        "owner": "hercules-ci",
-        "repo": "flake-parts",
-        "rev": "80daad04eddbbf5a4d883996a73f3f542fa437ac",
+        "lastModified": 1731533236,
+        "narHash": "sha256-l0KFg5HjrsfsO/JpG+r7fRrqm12kzFHyUHqHCVpMMbI=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "11707dc2f618dd54ca8739b309ec4fc024de578b",
        "type": "github"
      },
      "original": {
-        "owner": "hercules-ci",
-        "repo": "flake-parts",
+        "owner": "numtide",
+        "repo": "flake-utils",
        "type": "github"
      }
    },
    "nixpkgs": {
-      "locked": {
-        "lastModified": 1768127708,
-        "narHash": "sha256-1Sm77VfZh3mU0F5OqKABNLWxOuDeHIlcFjsXeeiPazs=",
-        "owner": "NixOS",
-        "repo": "nixpkgs",
-        "rev": "ffbc9f8cbaacfb331b6017d5a5abb21a492c9a38",
-        "type": "github"
-      },
-      "original": {
-        "owner": "NixOS",
-        "ref": "nixos-unstable",
-        "repo": "nixpkgs",
-        "type": "github"
-      }
-    },
-    "nixpkgs-swift": {
      "locked": {
        "lastModified": 1761672384,
        "narHash": "sha256-o9KF3DJL7g7iYMZq9SWgfS1BFlNbsm6xplRjVlOCkXI=",
@@ -121,74 +50,27 @@
      },
      "original": {
        "owner": "NixOS",
+        "ref": "nixos-unstable",
        "repo": "nixpkgs",
-        "rev": "08dacfca559e1d7da38f3cf05f1f45ee9bfd213c",
-        "type": "github"
-      }
-    },
-    "purescript-overlay": {
-      "inputs": {
-        "flake-compat": "flake-compat",
-        "nixpkgs": [
-          "dream2nix",
-          "nixpkgs"
-        ],
-        "slimlock": "slimlock"
-      },
-      "locked": {
-        "lastModified": 1728546539,
-        "narHash": "sha256-Sws7w0tlnjD+Bjck1nv29NjC5DbL6nH5auL9Ex9Iz2A=",
-        "owner": "thomashoneyman",
-        "repo": "purescript-overlay",
-        "rev": "4ad4c15d07bd899d7346b331f377606631eb0ee4",
-        "type": "github"
-      },
-      "original": {
-        "owner": "thomashoneyman",
-        "repo": "purescript-overlay",
-        "type": "github"
-      }
-    },
-    "pyproject-nix": {
-      "inputs": {
-        "nixpkgs": [
-          "dream2nix",
-          "nixpkgs"
-        ]
-      },
-      "locked": {
-        "lastModified": 1763017646,
-        "narHash": "sha256-Z+R2lveIp6Skn1VPH3taQIuMhABg1IizJd8oVdmdHsQ=",
-        "owner": "pyproject-nix",
-        "repo": "pyproject.nix",
-        "rev": "47bd6f296502842643078d66128f7b5e5370790c",
-        "type": "github"
-      },
-      "original": {
-        "owner": "pyproject-nix",
-        "repo": "pyproject.nix",
        "type": "github"
      }
    },
    "root": {
      "inputs": {
-        "crane": "crane",
-        "dream2nix": "dream2nix",
        "fenix": "fenix",
-        "flake-parts": "flake-parts",
+        "flake-utils": "flake-utils",
        "nixpkgs": "nixpkgs",
-        "nixpkgs-swift": "nixpkgs-swift",
        "treefmt-nix": "treefmt-nix"
      }
    },
    "rust-analyzer-src": {
      "flake": false,
      "locked": {
-        "lastModified": 1768224240,
-        "narHash": "sha256-Pp1dDrXKPBUJReZnnDElFyHYn67XTd48zRhToheLjtk=",
+        "lastModified": 1761849405,
+        "narHash": "sha256-igXdvC+WCUN+3gnfk+ptT7rMmxQuY6WbIg1rXMUN1DM=",
        "owner": "rust-lang",
        "repo": "rust-analyzer",
-        "rev": "725349602e525df37f377701e001fe8aab807878",
+        "rev": "f7de8ae045a5fe80f1203c5a1c3015b05f7c3550",
        "type": "github"
      },
      "original": {
@@ -198,25 +80,18 @@
        "type": "github"
      }
    },
-    "slimlock": {
-      "inputs": {
-        "nixpkgs": [
-          "dream2nix",
-          "purescript-overlay",
-          "nixpkgs"
-        ]
-      },
+    "systems": {
      "locked": {
-        "lastModified": 1688756706,
-        "narHash": "sha256-xzkkMv3neJJJ89zo3o2ojp7nFeaZc2G0fYwNXNJRFlo=",
-        "owner": "thomashoneyman",
-        "repo": "slimlock",
-        "rev": "cf72723f59e2340d24881fd7bf61cb113b4c407c",
+        "lastModified": 1681028828,
+        "narHash": "sha256-Vy1rq5AaRuLzOxct8nz4T6wlgyUR7zLU309k9mBC768=",
+        "owner": "nix-systems",
+        "repo": "default",
+        "rev": "da67096a3b9bf56a91d16901293e51ba5b49a27e",
        "type": "github"
      },
      "original": {
-        "owner": "thomashoneyman",
-        "repo": "slimlock",
+        "owner": "nix-systems",
+        "repo": "default",
        "type": "github"
      }
    },
@@ -227,11 +102,11 @@
        ]
      },
      "locked": {
-        "lastModified": 1768158989,
-        "narHash": "sha256-67vyT1+xClLldnumAzCTBvU0jLZ1YBcf4vANRWP3+Ak=",
+        "lastModified": 1762938485,
+        "narHash": "sha256-AlEObg0syDl+Spi4LsZIBrjw+snSVU4T8MOeuZJUJjM=",
        "owner": "numtide",
        "repo": "treefmt-nix",
-        "rev": "e96d59dff5c0d7fddb9d113ba108f03c3ef99eca",
+        "rev": "5b4ee75aeefd1e2d5a1cc43cf6ba65eba75e83e4",
        "type": "github"
      },
      "original": {
--- a/flake.nix
+++ b/flake.nix
@@ -3,134 +3,118 @@

  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
-
-    flake-parts = {
-      url = "github:hercules-ci/flake-parts";
-      inputs.nixpkgs-lib.follows = "nixpkgs";
-    };
-
-    crane.url = "github:ipetkov/crane";
-
+    flake-utils.url = "github:numtide/flake-utils";
+    # Provides Rust dev-env integration:
    fenix = {
      url = "github:nix-community/fenix";
      inputs.nixpkgs.follows = "nixpkgs";
    };
-
+    # Provides formatting infrastructure:
    treefmt-nix = {
      url = "github:numtide/treefmt-nix";
      inputs.nixpkgs.follows = "nixpkgs";
    };
-
-    dream2nix = {
-      url = "github:nix-community/dream2nix";
-      inputs.nixpkgs.follows = "nixpkgs";
-    };
-
-    # Pinned nixpkgs for swift-format (swift is broken on x86_64-linux in newer nixpkgs)
-    nixpkgs-swift.url = "github:NixOS/nixpkgs/08dacfca559e1d7da38f3cf05f1f45ee9bfd213c";
  };

-  nixConfig = {
-    extra-trusted-public-keys = "exo.cachix.org-1:okq7hl624TBeAR3kV+g39dUFSiaZgLRkLsFBCuJ2NZI=";
-    extra-substituters = "https://exo.cachix.org";
-  };
+  # TODO: figure out caching story
+  # nixConfig = {
+  #   # nix community cachix
+  #   extra-trusted-public-keys = "nix-community.cachix.org-1:mB9FSh9qf2dCimDSUo8Zy7bkq5CX+/rkCWyvRCYg3Fs=";
+  #   extra-substituters = "https://nix-community.cachix.org";
+  # };

  outputs =
    inputs:
-    inputs.flake-parts.lib.mkFlake { inherit inputs; } {
+    let
      systems = [
        "x86_64-linux"
        "aarch64-darwin"
        "aarch64-linux"
      ];
+      fenixToolchain = system: inputs.fenix.packages.${system}.complete;
+    in
+    inputs.flake-utils.lib.eachSystem systems (
+      system:
+      let
+        pkgs = import inputs.nixpkgs {
+          inherit system;
+          overlays = [ inputs.fenix.overlays.default ];
+        };
+        treefmtEval = inputs.treefmt-nix.lib.evalModule pkgs {
+          projectRootFile = "flake.nix";
+          programs.ruff-format.enable = true;
+          programs.ruff-format.excludes = [ "rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi" ];
+          programs.rustfmt.enable = true;
+          programs.rustfmt.package = (fenixToolchain system).rustfmt;
+          programs.nixpkgs-fmt.enable = true;
+        };
+      in
+      {
+        formatter = treefmtEval.config.build.wrapper;
+        checks.formatting = treefmtEval.config.build.check inputs.self;
+        checks.lint = pkgs.runCommand "lint-check" { } ''
+          export RUFF_CACHE_DIR="$TMPDIR/ruff-cache"
+          ${pkgs.ruff}/bin/ruff check ${inputs.self}/
+          touch $out
+        '';

-      imports = [
-        inputs.treefmt-nix.flakeModule
-        ./dashboard/parts.nix
-        ./rust/parts.nix
-      ];
+        devShells.default = pkgs.mkShell {
+          packages =
+            with pkgs;
+            [
+              # PYTHON
+              python313
+              uv
+              ruff
+              basedpyright

-      perSystem =
-        { config, self', inputs', pkgs, lib, system, ... }:
-        let
-          fenixToolchain = inputs'.fenix.packages.complete;
-          # Use pinned nixpkgs for swift-format (swift is broken on x86_64-linux in newer nixpkgs)
-          pkgsSwift = import inputs.nixpkgs-swift { inherit system; };
-        in
-        {
-          treefmt = {
-            projectRootFile = "flake.nix";
-            programs = {
-              nixpkgs-fmt.enable = true;
-              ruff-format = {
-                enable = true;
-                excludes = [ "rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi" ];
-              };
-              rustfmt = {
-                enable = true;
-                package = config.rust.toolchain;
-              };
-              prettier = {
-                enable = true;
-                includes = [ "*.ts" ];
-              };
-              swift-format = {
-                enable = true;
-                package = pkgsSwift.swiftPackages.swift-format;
-              };
-            };
-          };
+              # RUST
+              ((fenixToolchain system).withComponents [
+                "cargo"
+                "rustc"
+                "clippy"
+                "rustfmt"
+                "rust-src"
+              ])
+              rustup # Just here to make RustRover happy

-          checks.lint = pkgs.runCommand "lint-check" { } ''
-            export RUFF_CACHE_DIR="$TMPDIR/ruff-cache"
-            ${pkgs.ruff}/bin/ruff check ${inputs.self}/
-            touch $out
+              # NIX
+              nixpkgs-fmt
+
+              # SVELTE
+              nodejs
+
+              # MISC
+              just
+              jq
+            ]
+            ++ (pkgs.lib.optionals pkgs.stdenv.isLinux [
+              # IFCONFIG
+              unixtools.ifconfig
+
+              # Build dependencies for Linux
+              pkg-config
+              openssl
+            ])
+            ++ (pkgs.lib.optionals pkgs.stdenv.isDarwin [
+              # MACMON
+              macmon
+            ]);
+
+          shellHook = ''
+            # PYTHON
+            export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:${pkgs.python313}/lib"
+            ${pkgs.lib.optionalString pkgs.stdenv.isLinux ''
+              # Build environment for Linux
+              export PKG_CONFIG_PATH="${pkgs.openssl.dev}/lib/pkgconfig:$PKG_CONFIG_PATH"
+              export LD_LIBRARY_PATH="${pkgs.openssl.out}/lib:$LD_LIBRARY_PATH"
+            ''}
+            echo
+            echo "🍎🍎 Run 'just <recipe>' to get started"
+            just --list
          '';

-          devShells.default = with pkgs; pkgs.mkShell {
-            inputsFrom = [ self'.checks.cargo-build ];
-
-            packages =
-              [
-                # FORMATTING
-                config.treefmt.build.wrapper
-
-                # PYTHON
-                python313
-                uv
-                ruff
-                basedpyright
-
-                # RUST
-                config.rust.toolchain
-                maturin
-
-                # NIX
-                nixpkgs-fmt
-
-                # SVELTE
-                nodejs
-
-                # MISC
-                just
-                jq
-              ]
-              ++ lib.optionals stdenv.isLinux [
-                unixtools.ifconfig
-              ]
-              ++ lib.optionals stdenv.isDarwin [
-                macmon
-              ];
-
-            OPENSSL_NO_VENDOR = "1";
-
-            shellHook = ''
-              export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:${python313}/lib"
-              ${lib.optionalString stdenv.isLinux ''
-                export LD_LIBRARY_PATH="${openssl.out}/lib:$LD_LIBRARY_PATH"
-              ''}
-            '';
-          };
        };
-    };
+      }
+    );
 }
--- a/2
+++ b/2
@@ -1,5 +1,3 @@
-export NIX_CONFIG := "extra-experimental-features = nix-command flakes"
-
 fmt:
    nix fmt

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -8,22 +8,33 @@ dependencies = [
    "aiofiles>=24.1.0",
    "aiohttp>=3.12.14",
    "types-aiofiles>=24.1.0.20250708",
+    "typeguard>=4.4.4",
    "pydantic>=2.11.7",
+    "base58>=2.1.1",
+    "cryptography>=45.0.5",
    "fastapi>=0.116.1",
    "filelock>=3.18.0",
+    "aiosqlite>=0.21.0",
+    "networkx>=3.5",
+    "protobuf>=6.32.0",
+    "rich>=14.1.0",
    "rustworkx>=0.17.1",
+    "sqlmodel>=0.0.24",
+    "sqlalchemy[asyncio]>=2.0.43",
+    "greenlet>=3.2.4",
    "huggingface-hub>=0.33.4",
    "psutil>=7.0.0",
    "loguru>=0.7.3",
+    "textual>=5.3.0",
    "exo_pyo3_bindings", # rust bindings
    "anyio==4.11.0",
-    "mlx==0.30.1; sys_platform == 'darwin'",
-    "mlx[cpu]==0.30.1; sys_platform == 'linux'",
-    "mlx-lm @ git+https://github.com/AlexCheema/mlx-lm.git@fix-transformers-5.0.0rc2",
+    "bidict>=0.23.1",
+    "mlx>=0.30.1; sys_platform == 'darwin'",
+    "mlx[cpu]>=0.30.1; sys_platform == 'linux'",
+    "mlx-lm>=0.28.3",
    "tiktoken>=0.12.0", # required for kimi k2 tokenizer
    "hypercorn>=0.18.0",
    "openai-harmony>=0.0.8",
-    "httpx>=0.28.1",
 ]

 [project.scripts]
@@ -34,7 +45,6 @@ exo = "exo.main:main"
 # dependencies only required for development
 [dependency-groups]
 dev = [
-    "basedpyright>=1.29.0",
    "pyinstaller>=6.17.0",
    "pytest>=8.4.0",
    "pytest-asyncio>=1.0.0",
@@ -72,7 +82,7 @@ build-backend = "uv_build"
 ###

 [tool.basedpyright]
-include = [".venv/lib/mlx", ".venv/lib/mlx_lm", "src", "bench"]
+include = [".venv/lib/mlx", ".venv/lib/mlx_lm", "src"]
 typeCheckingMode = "strict"
 failOnWarnings = true

@@ -100,7 +110,6 @@ root = "src"

 # supported platforms for this project
 [tool.uv]
-prerelease = "allow"
 environments = [
    "sys_platform == 'darwin'",
    "sys_platform == 'linux'",
--- a/rust/parts.nix
+++ b/rust/parts.nix
@@ -1,145 +0,0 @@
-{ inputs, ... }:
-{
-  perSystem =
-    { config, self', inputs', pkgs, lib, ... }:
-    let
-      # Fenix nightly toolchain with all components
-      fenixPkgs = inputs'.fenix.packages;
-      rustToolchain = fenixPkgs.complete.withComponents [
-        "cargo"
-        "rustc"
-        "clippy"
-        "rustfmt"
-        "rust-src"
-        "rust-analyzer"
-      ];
-
-      # Crane with fenix toolchain
-      craneLib = (inputs.crane.mkLib pkgs).overrideToolchain rustToolchain;
-
-      # Source filtering - only include rust/ directory and root Cargo files
-      # This ensures changes to Python/docs/etc don't trigger Rust rebuilds
-      src = lib.cleanSourceWith {
-        src = inputs.self;
-        filter =
-          path: type:
-          let
-            baseName = builtins.baseNameOf path;
-            parentDir = builtins.dirOf path;
-            inRustDir =
-              (lib.hasInfix "/rust/" path)
-              || (lib.hasSuffix "/rust" parentDir)
-              || (baseName == "rust" && type == "directory");
-            isRootCargoFile =
-              (baseName == "Cargo.toml" || baseName == "Cargo.lock")
-              && (builtins.dirOf path == toString inputs.self);
-          in
-          isRootCargoFile
-          || (inRustDir && (craneLib.filterCargoSources path type || lib.hasSuffix ".toml" path || lib.hasSuffix ".md" path));
-      };
-
-      # Common arguments for all Rust builds
-      commonArgs = {
-        inherit src;
-        pname = "exo-rust";
-        version = "0.0.1";
-        strictDeps = true;
-
-        nativeBuildInputs = [
-          pkgs.pkg-config
-          pkgs.python313 # Required for pyo3-build-config
-        ];
-
-        buildInputs = [
-          pkgs.openssl
-          pkgs.python313 # Required for pyo3 tests
-        ];
-
-        OPENSSL_NO_VENDOR = "1";
-
-        # Required for pyo3 tests to find libpython
-        LD_LIBRARY_PATH = lib.makeLibraryPath [ pkgs.python313 ];
-      };
-
-      # Build dependencies once for caching
-      cargoArtifacts = craneLib.buildDepsOnly (
-        commonArgs
-        // {
-          cargoExtraArgs = "--workspace";
-        }
-      );
-    in
-    {
-      # Export toolchain for use in treefmt and devShell
-      options.rust = {
-        toolchain = lib.mkOption {
-          type = lib.types.package;
-          default = rustToolchain;
-          description = "The Rust toolchain to use";
-        };
-      };
-
-      config = {
-        packages = {
-          # Python bindings wheel via maturin
-          exo_pyo3_bindings = craneLib.buildPackage (
-            commonArgs
-            // {
-              inherit cargoArtifacts;
-              pname = "exo_pyo3_bindings";
-
-              nativeBuildInputs = commonArgs.nativeBuildInputs ++ [
-                pkgs.maturin
-              ];
-
-              buildPhaseCargoCommand = ''
-                maturin build \
-                  --release \
-                  --manylinux off \
-                  --manifest-path rust/exo_pyo3_bindings/Cargo.toml \
-                  --features "pyo3/extension-module,pyo3/experimental-async" \
-                  --interpreter ${pkgs.python313}/bin/python \
-                  --out dist
-              '';
-
-              # Don't use crane's default install behavior
-              doNotPostBuildInstallCargoBinaries = true;
-
-              installPhaseCommand = ''
-                mkdir -p $out
-                cp dist/*.whl $out/
-              '';
-            }
-          );
-        };
-
-        checks = {
-          # Full workspace build (all crates)
-          cargo-build = craneLib.buildPackage (
-            commonArgs
-            // {
-              inherit cargoArtifacts;
-              cargoExtraArgs = "--workspace";
-            }
-          );
-          # Run tests with nextest
-          cargo-nextest = craneLib.cargoNextest (
-            commonArgs
-            // {
-              inherit cargoArtifacts;
-              cargoExtraArgs = "--workspace";
-            }
-          );
-
-          # Build documentation
-          cargo-doc = craneLib.cargoDoc (
-            commonArgs
-            // {
-              inherit cargoArtifacts;
-              cargoExtraArgs = "--workspace";
-            }
-          );
-        };
-      };
-    };
-}
--- a/rust/system_custodian/Cargo.toml
+++ b/rust/system_custodian/Cargo.toml
@@ -0,0 +1,47 @@
+[package]
+name = "system_custodian"
+version = { workspace = true }
+edition = { workspace = true }
+publish = false
+
+[lib]
+doctest = false
+name = "system_custodian"
+path = "src/lib.rs"
+
+[[bin]]
+path = "src/bin/main.rs"
+name = "system_custodian"
+doc = false
+
+[lints]
+workspace = true
+
+[dependencies]
+# datastructures
+either = { workspace = true }
+
+# macro dependencies
+extend = { workspace = true }
+delegate = { workspace = true }
+impl-trait-for-tuples = { workspace = true }
+derive_more = { workspace = true }
+
+# async
+tokio = { workspace = true, features = ["full"] }
+futures = { workspace = true }
+futures-timer = { workspace = true }
+
+# utility dependencies
+util = { workspace = true }
+thiserror = { workspace = true }
+#internment = { workspace = true }
+#recursion = { workspace = true }
+#generativity = { workspace = true }
+#itertools = { workspace = true }
+tracing-subscriber = { version = "0.3.19", features = ["default", "env-filter"] }
+keccak-const = { workspace = true }
+
+# tracing/logging
+log = { workspace = true }
+
--- a/rust/system_custodian/src/bin/main.rs
+++ b/rust/system_custodian/src/bin/main.rs
@@ -0,0 +1,4 @@
+//! TODO: documentation
+//!
+
+fn main() {}
--- a/rust/system_custodian/src/lib.rs
+++ b/rust/system_custodian/src/lib.rs
@@ -0,0 +1,69 @@
+//! This crate defines the logic of, and ways to interact with, Exo's **_System Custodian_** daemon.
+//!
+//! The **_System Custodian_** daemon is supposed to be a long-living process that precedes the
+//! launch of the Exo application, and responsible for ensuring the system (configuration, settings,
+//! etc.) is in an appropriate state to facilitate the running of Exo application.
+//! The **_System Custodian_** daemon shall expose a [D-Bus](https://www.freedesktop.org/wiki/Software/dbus/)
+//! service which Exo application use to _control & query_ it.
+//!
+//! # Lifecycle
+//! When the Exo application starts, it will _wake_ the **_System Custodian_** daemon for the
+//! duration of its lifetime, and after it has terminated the daemon will go back to sleep. When
+//! the daemon wakes up, it will configure the system into a state suitable for the Exo Application;
+//! When the daemon goes to sleep, it will revert those changes as much as it can in case they were
+//! destructive to the user's pre-existing configurations.
+//!
+//! # Responsibilities
+//! TODO: these are purely on MacOS, but change to be more broad
+//! The **_System Custodian_** daemon is responsible for using System Configuration framework to
+//!  1. duplicate the current network set
+//!  2. modify existing services to turn on IPv6 if not there
+//!  3. remove any bridge services & add any missing services that AREN'T bridge
+//! TODO: In the future:
+//!  1. run a dummy AWDL service to [allow for macOS peer-to-peer wireless networking](https://yggdrasil-network.github.io/2019/08/19/awdl.html)
+//!  2. toggle some GPU/memory configurations to speed up GPU (ask Alex what those configurations are)
+//!  3. if we ever decide to provide our **own network interfaces** that abstract over some userland
+//!     logic, this would be the place to spin that up.
+//!
+//! Then it will watch the SCDynamicStore for:
+//!  1. all __actual__ network interfaces -> collect information on them e.g. their BSD name, MAC
+//!     address, MTU, IPv6 addresses, etc. -> and set up watchers/notifiers to inform the DBus
+//!     interface of any changes
+//!  2. watch for any __undesirable__ changes to configuration and revert it
+//!
+//! It should somehow (probably through system sockets and/or BSD interface) trigger IPv6 NDP on
+//! each of the interfaces & also listen to/query for any changes on the OS routing cache??
+//! Basically emulate the `ping6 ff02::1%enX` and `ndp -an` commands BUT BETTER!!!
+//!  1. all that info should coalesce back to the overall state colleted -> should be queryable
+//!     over D-Bus
+//! TODO:
+//!  1. we might potentially add to this step a handshake of some kind...? To ensure that we can
+//!     ACTUALLY communicate with that machine over that link over e.g. TCP, UDP, etc. Will the
+//!     handshake require to know Node ID? Will the handshake require heartbeats? Who knows...
+//!  2. if we ever decide to write proprietary L2/L3 protocols for quicker communication,
+//!     e.g. [AF_NDRV](https://www.zerotier.com/blog/how-zerotier-eliminated-kernel-extensions-on-macos/)
+//!     for raw ethernet frame communication, or even a [custom thunderbolt PCIe driver](https://developer.apple.com/documentation/pcidriverkit/creating-custom-pcie-drivers-for-thunderbolt-devices),
+//!     then this would be the place to carry out discovery and propper handshakes with devices
+//!     on the other end of the link.
+//!
+
+// enable Rust-unstable features for convenience
+#![feature(trait_alias)]
+#![feature(stmt_expr_attributes)]
+#![feature(type_alias_impl_trait)]
+#![feature(specialization)]
+#![feature(unboxed_closures)]
+#![feature(const_trait_impl)]
+#![feature(fn_traits)]
+
+pub(crate) mod private {
+    // sealed traits support
+    pub trait Sealed {}
+    impl<T: ?Sized> Sealed for T {}
+}
+
+/// Namespace for all the type/trait aliases used by this crate.
+pub(crate) mod alias {}
+
+/// Namespace for crate-wide extension traits/methods
+pub(crate) mod ext {}
--- a/src/exo/main.py
+++ b/src/exo/main.py
@@ -1,7 +1,6 @@
 import argparse
 import multiprocessing as mp
 import os
-import resource
 import signal
 from dataclasses import dataclass, field
 from typing import Self
@@ -29,7 +28,7 @@ from exo.worker.main import Worker
@dataclass
 class Node:
    router: Router
-    worker: Worker | None
+    worker: Worker
    election: Election  # Every node participates in election, as we do want a node to become master even if it isn't a master candidate if no master candidates are present.
    election_result_receiver: Receiver[ElectionResult]
    master: Master | None
@@ -63,19 +62,15 @@ class Node:
        else:
            api = None

-        if not args.no_worker:
-            worker = Worker(
-                node_id,
-                session_id,
-                exo_shard_downloader(),
-                connection_message_receiver=router.receiver(topics.CONNECTION_MESSAGES),
-                global_event_receiver=router.receiver(topics.GLOBAL_EVENTS),
-                local_event_sender=router.sender(topics.LOCAL_EVENTS),
-                command_sender=router.sender(topics.COMMANDS),
-            )
-        else:
-            worker = None
-
+        worker = Worker(
+            node_id,
+            session_id,
+            exo_shard_downloader(),
+            connection_message_receiver=router.receiver(topics.CONNECTION_MESSAGES),
+            global_event_receiver=router.receiver(topics.GLOBAL_EVENTS),
+            local_event_sender=router.sender(topics.LOCAL_EVENTS),
+            command_sender=router.sender(topics.COMMANDS),
+        )
        # We start every node with a master
        master = Master(
            node_id,
@@ -105,9 +100,8 @@ class Node:
        async with self._tg as tg:
            signal.signal(signal.SIGINT, lambda _, __: self.shutdown())
            tg.start_soon(self.router.run)
+            tg.start_soon(self.worker.run)
            tg.start_soon(self.election.run)
-            if self.worker:
-                tg.start_soon(self.worker.run)
            if self.master:
                tg.start_soon(self.master.run)
            if self.api:
@@ -196,8 +190,6 @@ class Node:

 def main():
    args = Args.parse()
-    soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
-    resource.setrlimit(resource.RLIMIT_NOFILE, (max(soft, 65535), hard))

    mp.set_start_method("spawn")
    # TODO: Refactor the current verbosity system
@@ -205,14 +197,6 @@ def main():
    logger.info("Starting EXO")
    logger.info(f"EXO_LIBP2P_NAMESPACE: {os.getenv('EXO_LIBP2P_NAMESPACE')}")

-    # Set FAST_SYNCH override env var for runner subprocesses
-    if args.fast_synch is True:
-        os.environ["EXO_FAST_SYNCH"] = "on"
-        logger.info("FAST_SYNCH forced ON")
-    elif args.fast_synch is False:
-        os.environ["EXO_FAST_SYNCH"] = "off"
-        logger.info("FAST_SYNCH forced OFF")
-
    node = anyio.run(Node.create, args)
    anyio.run(node.run)
    logger.info("EXO Shutdown complete")
@@ -225,8 +209,6 @@ class Args(CamelCaseModel):
    spawn_api: bool = False
    api_port: PositiveInt = 52415
    tb_only: bool = False
-    no_worker: bool = False
-    fast_synch: bool | None = None  # None = auto, True = force on, False = force off

    @classmethod
    def parse(cls) -> Self:
@@ -264,24 +246,6 @@ class Args(CamelCaseModel):
            dest="api_port",
            default=52415,
        )
-        parser.add_argument(
-            "--no-worker",
-            action="store_true",
-        )
-        fast_synch_group = parser.add_mutually_exclusive_group()
-        fast_synch_group.add_argument(
-            "--fast-synch",
-            action="store_true",
-            dest="fast_synch",
-            default=None,
-            help="Force MLX FAST_SYNCH on (for JACCL backend)",
-        )
-        fast_synch_group.add_argument(
-            "--no-fast-synch",
-            action="store_false",
-            dest="fast_synch",
-            help="Force MLX FAST_SYNCH off",
-        )

        args = parser.parse_args()
        return cls(**vars(args))  # pyright: ignore[reportAny] - We are intentionally validating here, we can't do it statically
--- a/src/exo/master/adapters/init.py
+++ b/src/exo/master/adapters/init.py
@@ -1 +0,0 @@
-"""API adapters for different API formats (Claude, OpenAI Responses, etc.)."""
--- a/src/exo/master/adapters/chat_completions.py
+++ b/src/exo/master/adapters/chat_completions.py
@@ -1,175 +0,0 @@
-"""OpenAI Chat Completions API adapter for converting requests/responses."""
-
-import time
-from collections.abc import AsyncGenerator
-
-from exo.shared.types.api import (
-    ChatCompletionChoice,
-    ChatCompletionMessage,
-    ChatCompletionMessageText,
-    ChatCompletionResponse,
-    ChatCompletionTaskParams,
-    ErrorInfo,
-    ErrorResponse,
-    FinishReason,
-    Logprobs,
-    LogprobsContentItem,
-    StreamingChoiceResponse,
-)
-from exo.shared.types.chunks import TokenChunk
-from exo.shared.types.common import CommandId
-from exo.shared.types.openai_responses import ResponseInputMessage, ResponsesRequest
-
-
-def chat_request_to_internal(request: ChatCompletionTaskParams) -> ResponsesRequest:
-    """Convert Chat Completions API request to ResponsesRequest (canonical internal format).
-
-    Extracts system message as instructions, converts messages to input.
-    """
-    instructions: str | None = None
-    input_messages: list[ResponseInputMessage] = []
-
-    for msg in request.messages:
-        # Normalize content to string
-        content: str
-        if msg.content is None:
-            content = ""
-        elif isinstance(msg.content, str):
-            content = msg.content
-        elif isinstance(msg.content, ChatCompletionMessageText):
-            content = msg.content.text
-        else:
-            # List of ChatCompletionMessageText
-            content = "\n".join(item.text for item in msg.content)
-
-        # Extract system message as instructions
-        if msg.role == "system":
-            if instructions is None:
-                instructions = content
-            else:
-                # Append additional system messages
-                instructions = f"{instructions}\n{content}"
-        else:
-            # Convert to ResponseInputMessage (only user, assistant, developer roles)
-            if msg.role in ("user", "assistant", "developer"):
-                input_messages.append(
-                    ResponseInputMessage(role=msg.role, content=content)
-                )
-
-    return ResponsesRequest(
-        model=request.model,
-        input=input_messages if input_messages else "",
-        instructions=instructions,
-        max_output_tokens=request.max_tokens,
-        temperature=request.temperature,
-        top_p=request.top_p,
-        top_k=request.top_k,
-        stop=request.stop,
-        seed=request.seed,
-        stream=request.stream,
-        tools=request.tools,
-        continue_from_prefix=request.continue_from_prefix,
-    )
-
-
-def chunk_to_response(
-    chunk: TokenChunk, command_id: CommandId
-) -> ChatCompletionResponse:
-    """Convert a TokenChunk to a streaming ChatCompletionResponse."""
-    # Build logprobs if available
-    logprobs: Logprobs | None = None
-    if chunk.logprob is not None:
-        logprobs = Logprobs(
-            content=[
-                LogprobsContentItem(
-                    token=chunk.text,
-                    logprob=chunk.logprob,
-                    top_logprobs=chunk.top_logprobs or [],
-                )
-            ]
-        )
-
-    return ChatCompletionResponse(
-        id=command_id,
-        created=int(time.time()),
-        model=chunk.model,
-        choices=[
-            StreamingChoiceResponse(
-                index=0,
-                delta=ChatCompletionMessage(role="assistant", content=chunk.text),
-                logprobs=logprobs,
-                finish_reason=chunk.finish_reason,
-            )
-        ],
-    )
-
-
-async def generate_chat_stream(
-    command_id: CommandId,
-    chunk_stream: AsyncGenerator[TokenChunk, None],
-) -> AsyncGenerator[str, None]:
-    """Generate Chat Completions API streaming events from TokenChunks."""
-    async for chunk in chunk_stream:
-        if chunk.finish_reason == "error":
-            error_response = ErrorResponse(
-                error=ErrorInfo(
-                    message=chunk.error_message or "Internal server error",
-                    type="InternalServerError",
-                    code=500,
-                )
-            )
-            yield f"data: {error_response.model_dump_json()}\n\n"
-            yield "data: [DONE]\n\n"
-            return
-
-        chunk_response = chunk_to_response(chunk, command_id)
-        yield f"data: {chunk_response.model_dump_json()}\n\n"
-
-        if chunk.finish_reason is not None:
-            yield "data: [DONE]\n\n"
-
-
-async def collect_chat_response(
-    command_id: CommandId,
-    chunk_stream: AsyncGenerator[TokenChunk, None],
-) -> ChatCompletionResponse:
-    """Collect all token chunks and return a single ChatCompletionResponse."""
-    text_parts: list[str] = []
-    model: str | None = None
-    finish_reason: FinishReason | None = None
-    error_message: str | None = None
-
-    async for chunk in chunk_stream:
-        if chunk.finish_reason == "error":
-            error_message = chunk.error_message or "Internal server error"
-            break
-
-        if model is None:
-            model = chunk.model
-
-        text_parts.append(chunk.text)
-
-        if chunk.finish_reason is not None:
-            finish_reason = chunk.finish_reason
-
-    if error_message is not None:
-        raise ValueError(error_message)
-
-    combined_text = "".join(text_parts)
-    assert model is not None
-
-    return ChatCompletionResponse(
-        id=command_id,
-        created=int(time.time()),
-        model=model,
-        choices=[
-            ChatCompletionChoice(
-                index=0,
-                message=ChatCompletionMessage(
-                    role="assistant",
-                    content=combined_text,
-                ),
-                finish_reason=finish_reason,
-            )
-        ],
-    )
--- a/src/exo/master/adapters/claude.py
+++ b/src/exo/master/adapters/claude.py
@@ -1,190 +0,0 @@
-"""Claude Messages API adapter for converting requests/responses."""
-
-from collections.abc import AsyncGenerator
-
-from exo.shared.types.api import FinishReason
-from exo.shared.types.chunks import TokenChunk
-from exo.shared.types.claude_api import (
-    ClaudeContentBlockDeltaEvent,
-    ClaudeContentBlockStartEvent,
-    ClaudeContentBlockStopEvent,
-    ClaudeMessageDelta,
-    ClaudeMessageDeltaEvent,
-    ClaudeMessageDeltaUsage,
-    ClaudeMessagesRequest,
-    ClaudeMessagesResponse,
-    ClaudeMessageStart,
-    ClaudeMessageStartEvent,
-    ClaudeMessageStopEvent,
-    ClaudeStopReason,
-    ClaudeTextBlock,
-    ClaudeTextDelta,
-    ClaudeUsage,
-)
-from exo.shared.types.common import CommandId
-from exo.shared.types.openai_responses import ResponseInputMessage, ResponsesRequest
-
-
-def finish_reason_to_claude_stop_reason(
-    finish_reason: FinishReason | None,
-) -> ClaudeStopReason | None:
-    """Map OpenAI finish_reason to Claude stop_reason."""
-    if finish_reason is None:
-        return None
-    mapping: dict[FinishReason, ClaudeStopReason] = {
-        "stop": "end_turn",
-        "length": "max_tokens",
-        "tool_calls": "tool_use",
-        "content_filter": "end_turn",
-        "function_call": "tool_use",
-    }
-    return mapping.get(finish_reason, "end_turn")
-
-
-def claude_request_to_internal(request: ClaudeMessagesRequest) -> ResponsesRequest:
-    """Convert Claude Messages API request to ResponsesRequest (canonical internal format).
-
-    Converts Claude's system parameter to instructions,
-    and messages to input.
-    """
-    # Handle system message
-    instructions: str | None = None
-    if request.system:
-        if isinstance(request.system, str):
-            instructions = request.system
-        else:
-            # List of text blocks
-            instructions = "".join(block.text for block in request.system)
-
-    # Convert messages to input
-    input_messages: list[ResponseInputMessage] = []
-    for msg in request.messages:
-        content: str
-        if isinstance(msg.content, str):
-            content = msg.content
-        else:
-            # Concatenate text blocks (images not supported for MVP)
-            text_parts: list[str] = []
-            for block in msg.content:
-                if isinstance(block, ClaudeTextBlock):
-                    text_parts.append(block.text)
-            content = "".join(text_parts)
-
-        # Claude uses "user" and "assistant" roles
-        input_messages.append(ResponseInputMessage(role=msg.role, content=content))
-
-    return ResponsesRequest(
-        model=request.model,
-        input=input_messages if input_messages else "",
-        instructions=instructions,
-        max_output_tokens=request.max_tokens,
-        temperature=request.temperature,
-        top_p=request.top_p,
-        top_k=request.top_k,
-        stop=request.stop_sequences,
-        stream=request.stream,
-    )
-
-
-async def collect_claude_response(
-    command_id: CommandId,
-    model: str,
-    chunk_stream: AsyncGenerator[TokenChunk, None],
-) -> ClaudeMessagesResponse:
-    """Collect all token chunks and return a single ClaudeMessagesResponse."""
-    text_parts: list[str] = []
-    stop_reason: ClaudeStopReason | None = None
-    last_stats = None
-    error_message: str | None = None
-
-    async for chunk in chunk_stream:
-        if chunk.finish_reason == "error":
-            error_message = chunk.error_message or "Internal server error"
-            break
-
-        text_parts.append(chunk.text)
-        last_stats = chunk.stats or last_stats
-
-        if chunk.finish_reason is not None:
-            stop_reason = finish_reason_to_claude_stop_reason(chunk.finish_reason)
-
-    if error_message is not None:
-        raise ValueError(error_message)
-
-    combined_text = "".join(text_parts)
-
-    # Use actual usage data from stats if available
-    input_tokens = last_stats.prompt_tokens if last_stats else 0
-    output_tokens = last_stats.generation_tokens if last_stats else 0
-
-    return ClaudeMessagesResponse(
-        id=f"msg_{command_id}",
-        model=model,
-        content=[ClaudeTextBlock(text=combined_text)],
-        stop_reason=stop_reason,
-        usage=ClaudeUsage(
-            input_tokens=input_tokens,
-            output_tokens=output_tokens,
-        ),
-    )
-
-
-async def generate_claude_stream(
-    command_id: CommandId,
-    model: str,
-    chunk_stream: AsyncGenerator[TokenChunk, None],
-) -> AsyncGenerator[str, None]:
-    """Generate Claude Messages API streaming events from TokenChunks."""
-    # Initial message_start event
-    initial_message = ClaudeMessageStart(
-        id=f"msg_{command_id}",
-        model=model,
-        content=[],
-        stop_reason=None,
-        usage=ClaudeUsage(input_tokens=0, output_tokens=0),
-    )
-    start_event = ClaudeMessageStartEvent(message=initial_message)
-    yield f"event: message_start\ndata: {start_event.model_dump_json()}\n\n"
-
-    # content_block_start
-    block_start = ClaudeContentBlockStartEvent(
-        index=0, content_block=ClaudeTextBlock(text="")
-    )
-    yield f"event: content_block_start\ndata: {block_start.model_dump_json()}\n\n"
-
-    output_tokens = 0
-    stop_reason: ClaudeStopReason | None = None
-    last_stats = None
-
-    async for chunk in chunk_stream:
-        output_tokens += 1  # Count each chunk as one token
-        last_stats = chunk.stats or last_stats
-
-        # content_block_delta
-        delta_event = ClaudeContentBlockDeltaEvent(
-            index=0,
-            delta=ClaudeTextDelta(text=chunk.text),
-        )
-        yield f"event: content_block_delta\ndata: {delta_event.model_dump_json()}\n\n"
-
-        if chunk.finish_reason is not None:
-            stop_reason = finish_reason_to_claude_stop_reason(chunk.finish_reason)
-
-    # Use actual token count from stats if available
-    if last_stats is not None:
-        output_tokens = last_stats.generation_tokens
-
-    # content_block_stop
-    block_stop = ClaudeContentBlockStopEvent(index=0)
-    yield f"event: content_block_stop\ndata: {block_stop.model_dump_json()}\n\n"
-
-    # message_delta
-    message_delta = ClaudeMessageDeltaEvent(
-        delta=ClaudeMessageDelta(stop_reason=stop_reason),
-        usage=ClaudeMessageDeltaUsage(output_tokens=output_tokens),
-    )
-    yield f"event: message_delta\ndata: {message_delta.model_dump_json()}\n\n"
-
-    # message_stop
-    message_stop = ClaudeMessageStopEvent()
-    yield f"event: message_stop\ndata: {message_stop.model_dump_json()}\n\n"
--- a/src/exo/master/adapters/responses.py
+++ b/src/exo/master/adapters/responses.py
@@ -1,173 +0,0 @@
-"""OpenAI Responses API adapter for converting requests/responses.
-
-ResponsesRequest is the canonical internal format. Responses API is the most featureful,
-making it the natural choice for the internal format. All other API formats (Chat
-Completions, Claude) are converted TO ResponsesRequest.
-"""
-
-from collections.abc import AsyncGenerator
-
-from exo.shared.types.chunks import TokenChunk
-from exo.shared.types.common import CommandId
-from exo.shared.types.openai_responses import (
-    ResponseCompletedEvent,
-    ResponseContentPartAddedEvent,
-    ResponseContentPartDoneEvent,
-    ResponseCreatedEvent,
-    ResponseInProgressEvent,
-    ResponseMessageItem,
-    ResponseOutputItemAddedEvent,
-    ResponseOutputItemDoneEvent,
-    ResponseOutputText,
-    ResponsesResponse,
-    ResponseTextDeltaEvent,
-    ResponseTextDoneEvent,
-    ResponseUsage,
-)
-
-
-async def collect_responses_response(
-    command_id: CommandId,
-    model: str,
-    chunk_stream: AsyncGenerator[TokenChunk, None],
-) -> ResponsesResponse:
-    """Collect all token chunks and return a single ResponsesResponse."""
-    response_id = f"resp_{command_id}"
-    item_id = f"item_{command_id}"
-    accumulated_text = ""
-    last_stats = None
-    error_message: str | None = None
-
-    async for chunk in chunk_stream:
-        if chunk.finish_reason == "error":
-            error_message = chunk.error_message or "Internal server error"
-            break
-
-        accumulated_text += chunk.text
-        last_stats = chunk.stats or last_stats
-
-    if error_message is not None:
-        raise ValueError(error_message)
-
-    # Create usage from stats if available
-    usage = None
-    if last_stats is not None:
-        usage = ResponseUsage(
-            input_tokens=last_stats.prompt_tokens,
-            output_tokens=last_stats.generation_tokens,
-            total_tokens=last_stats.prompt_tokens + last_stats.generation_tokens,
-        )
-
-    output_item = ResponseMessageItem(
-        id=item_id,
-        content=[ResponseOutputText(text=accumulated_text)],
-        status="completed",
-    )
-
-    return ResponsesResponse(
-        id=response_id,
-        model=model,
-        status="completed",
-        output=[output_item],
-        output_text=accumulated_text,
-        usage=usage,
-    )
-
-
-async def generate_responses_stream(
-    command_id: CommandId,
-    model: str,
-    chunk_stream: AsyncGenerator[TokenChunk, None],
-) -> AsyncGenerator[str, None]:
-    """Generate OpenAI Responses API streaming events from TokenChunks."""
-    response_id = f"resp_{command_id}"
-    item_id = f"item_{command_id}"
-
-    # response.created
-    initial_response = ResponsesResponse(
-        id=response_id,
-        model=model,
-        status="in_progress",
-        output=[],
-        output_text="",
-    )
-    created_event = ResponseCreatedEvent(response=initial_response)
-    yield f"event: response.created\ndata: {created_event.model_dump_json()}\n\n"
-
-    # response.in_progress
-    in_progress_event = ResponseInProgressEvent(response=initial_response)
-    yield f"event: response.in_progress\ndata: {in_progress_event.model_dump_json()}\n\n"
-
-    # response.output_item.added
-    initial_item = ResponseMessageItem(
-        id=item_id,
-        content=[ResponseOutputText(text="")],
-        status="in_progress",
-    )
-    item_added = ResponseOutputItemAddedEvent(output_index=0, item=initial_item)
-    yield f"event: response.output_item.added\ndata: {item_added.model_dump_json()}\n\n"
-
-    # response.content_part.added
-    initial_part = ResponseOutputText(text="")
-    part_added = ResponseContentPartAddedEvent(
-        output_index=0, content_index=0, part=initial_part
-    )
-    yield f"event: response.content_part.added\ndata: {part_added.model_dump_json()}\n\n"
-
-    accumulated_text = ""
-    last_stats = None
-
-    async for chunk in chunk_stream:
-        accumulated_text += chunk.text
-        last_stats = chunk.stats or last_stats
-
-        # response.output_text.delta
-        delta_event = ResponseTextDeltaEvent(
-            output_index=0,
-            content_index=0,
-            delta=chunk.text,
-        )
-        yield f"event: response.output_text.delta\ndata: {delta_event.model_dump_json()}\n\n"
-
-    # response.output_text.done
-    text_done = ResponseTextDoneEvent(
-        output_index=0, content_index=0, text=accumulated_text
-    )
-    yield f"event: response.output_text.done\ndata: {text_done.model_dump_json()}\n\n"
-
-    # response.content_part.done
-    final_part = ResponseOutputText(text=accumulated_text)
-    part_done = ResponseContentPartDoneEvent(
-        output_index=0, content_index=0, part=final_part
-    )
-    yield f"event: response.content_part.done\ndata: {part_done.model_dump_json()}\n\n"
-
-    # response.output_item.done
-    final_item = ResponseMessageItem(
-        id=item_id,
-        content=[ResponseOutputText(text=accumulated_text)],
-        status="completed",
-    )
-    item_done = ResponseOutputItemDoneEvent(output_index=0, item=final_item)
-    yield f"event: response.output_item.done\ndata: {item_done.model_dump_json()}\n\n"
-
-    # Create usage from stats if available
-    usage = None
-    if last_stats is not None:
-        usage = ResponseUsage(
-            input_tokens=last_stats.prompt_tokens,
-            output_tokens=last_stats.generation_tokens,
-            total_tokens=last_stats.prompt_tokens + last_stats.generation_tokens,
-        )
-
-    # response.completed
-    final_response = ResponsesResponse(
-        id=response_id,
-        model=model,
-        status="completed",
-        output=[final_item],
-        output_text=accumulated_text,
-        usage=usage,
-    )
-    completed_event = ResponseCompletedEvent(response=final_response)
-    yield f"event: response.completed\ndata: {completed_event.model_dump_json()}\n\n"
--- a/src/exo/master/api.py
+++ b/src/exo/master/api.py
@@ -1,33 +1,25 @@
+import time
 from collections.abc import AsyncGenerator
-from http import HTTPStatus
 from typing import cast

 import anyio
-from anyio import BrokenResourceError, create_task_group
+from anyio import create_task_group
 from anyio.abc import TaskGroup
-from fastapi import FastAPI, HTTPException, Request
+from fastapi import FastAPI, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
-from fastapi.responses import JSONResponse, StreamingResponse
+from fastapi.responses import StreamingResponse
 from fastapi.staticfiles import StaticFiles
 from hypercorn.asyncio import serve  # pyright: ignore[reportUnknownVariableType]
 from hypercorn.config import Config
 from hypercorn.typing import ASGIFramework
 from loguru import logger
+from openai_harmony import (  # pyright: ignore[reportMissingTypeStubs]
+    HarmonyEncodingName,
+    Role,
+    StreamableParser,
+    load_harmony_encoding,
+)

-from exo.master.adapters.chat_completions import (
-    chat_request_to_internal,
-    collect_chat_response,
-    generate_chat_stream,
-)
-from exo.master.adapters.claude import (
-    claude_request_to_internal,
-    collect_claude_response,
-    generate_claude_stream,
-)
-from exo.master.adapters.responses import (
-    collect_responses_response,
-    generate_responses_stream,
-)
 from exo.master.placement import place_instance as get_instance_placements
 from exo.shared.apply import apply
 from exo.shared.election import ElectionMessage
@@ -35,29 +27,21 @@ from exo.shared.logging import InterceptLogger
 from exo.shared.models.model_cards import MODEL_CARDS
 from exo.shared.models.model_meta import get_model_meta
 from exo.shared.types.api import (
-    BenchChatCompletionResponse,
-    BenchChatCompletionTaskParams,
    ChatCompletionChoice,
    ChatCompletionMessage,
    ChatCompletionResponse,
-    ChatCompletionTaskParams,
    CreateInstanceParams,
    CreateInstanceResponse,
    DeleteInstanceResponse,
-    ErrorInfo,
-    ErrorResponse,
-    GenerationStats,
+    FinishReason,
    ModelList,
    ModelListModel,
    PlaceInstanceParams,
    PlacementPreview,
    PlacementPreviewResponse,
+    StreamingChoiceResponse,
 )
 from exo.shared.types.chunks import TokenChunk
-from exo.shared.types.claude_api import (
-    ClaudeMessagesRequest,
-    ClaudeMessagesResponse,
-)
 from exo.shared.types.commands import (
    ChatCompletion,
    Command,
@@ -68,19 +52,11 @@ from exo.shared.types.commands import (
    TaskFinished,
 )
 from exo.shared.types.common import CommandId, NodeId, SessionId
-from exo.shared.types.events import (
-    ChunkGenerated,
-    Event,
-    ForwarderEvent,
-    IndexedEvent,
-)
+from exo.shared.types.events import ChunkGenerated, Event, ForwarderEvent, IndexedEvent
 from exo.shared.types.memory import Memory
 from exo.shared.types.models import ModelId, ModelMetadata
-from exo.shared.types.openai_responses import (
-    ResponsesRequest,
-    ResponsesResponse,
-)
 from exo.shared.types.state import State
+from exo.shared.types.tasks import ChatCompletionTaskParams
 from exo.shared.types.worker.instances import Instance, InstanceId, InstanceMeta
 from exo.shared.types.worker.shards import Sharding
 from exo.utils.banner import print_startup_banner
@@ -88,6 +64,25 @@ from exo.utils.channels import Receiver, Sender, channel
 from exo.utils.dashboard_path import find_dashboard
 from exo.utils.event_buffer import OrderedBuffer

+encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
+
+
+def chunk_to_response(
+    chunk: TokenChunk, command_id: CommandId
+) -> ChatCompletionResponse:
+    return ChatCompletionResponse(
+        id=command_id,
+        created=int(time.time()),
+        model=chunk.model,
+        choices=[
+            StreamingChoiceResponse(
+                index=0,
+                delta=ChatCompletionMessage(role="assistant", content=chunk.text),
+                finish_reason=chunk.finish_reason,
+            )
+        ],
+    )
+

 async def resolve_model_meta(model_id: str) -> ModelMetadata:
    if model_id in MODEL_CARDS:
@@ -125,7 +120,6 @@ class API:
        self.paused_ev: anyio.Event = anyio.Event()

        self.app = FastAPI()
-        self._setup_exception_handlers()
        self._setup_cors()
        self._setup_routes()

@@ -156,20 +150,6 @@ class API:
        self.paused_ev.set()
        self.paused_ev = anyio.Event()

-    def _setup_exception_handlers(self) -> None:
-        @self.app.exception_handler(HTTPException)
-        async def http_exception_handler(  # pyright: ignore[reportUnusedFunction]
-            _: Request, exc: HTTPException
-        ) -> JSONResponse:
-            err = ErrorResponse(
-                error=ErrorInfo(
-                    message=exc.detail,
-                    type=HTTPStatus(exc.status_code).phrase,
-                    code=exc.status_code,
-                )
-            )
-            return JSONResponse(err.model_dump(), status_code=exc.status_code)
-
    def _setup_cors(self) -> None:
        self.app.add_middleware(
            CORSMiddleware,
@@ -192,9 +172,6 @@ class API:
        self.app.post("/v1/chat/completions", response_model=None)(
            self.chat_completions
        )
-        self.app.post("/bench/chat/completions")(self.bench_chat_completions)
-        self.app.post("/v1/messages", response_model=None)(self.claude_messages)
-        self.app.post("/v1/responses", response_model=None)(self.openai_responses)
        self.app.get("/state")(lambda: self.state)
        self.app.get("/events")(lambda: self._event_log)

@@ -400,21 +377,52 @@ class API:
            instance_id=instance_id,
        )

-    async def _token_chunk_stream(
-        self, command_id: CommandId
-    ) -> AsyncGenerator[TokenChunk, None]:
-        """Yield `TokenChunk`s for a given command until completion.
+    async def _process_gpt_oss(self, token_chunks: Receiver[TokenChunk]):
+        stream = StreamableParser(encoding, role=Role.ASSISTANT)
+        thinking = False
+
+        async for chunk in token_chunks:
+            stream.process(chunk.token_id)
+
+            delta = stream.last_content_delta
+            ch = stream.current_channel
+
+            if ch == "analysis" and not thinking:
+                thinking = True
+                yield chunk.model_copy(update={"text": "<think>"})
+
+            if ch != "analysis" and thinking:
+                thinking = False
+                yield chunk.model_copy(update={"text": "</think>"})
+
+            if delta:
+                yield chunk.model_copy(update={"text": delta})
+
+            if chunk.finish_reason is not None:
+                if thinking:
+                    yield chunk.model_copy(update={"text": "</think>"})
+                yield chunk
+                break
+
+    async def _chat_chunk_stream(
+        self, command_id: CommandId, parse_gpt_oss: bool
+    ) -> AsyncGenerator[TokenChunk, None]:
+        """Yield `TokenChunk`s for a given command until completion."""

-        This is the internal low-level stream used by all API adapters.
-        """
        try:
            self._chat_completion_queues[command_id], recv = channel[TokenChunk]()

            with recv as token_chunks:
-                async for chunk in token_chunks:
-                    yield chunk
-                    if chunk.finish_reason is not None:
-                        break
+                if parse_gpt_oss:
+                    async for chunk in self._process_gpt_oss(token_chunks):
+                        yield chunk
+                        if chunk.finish_reason is not None:
+                            break
+                else:
+                    async for chunk in token_chunks:
+                        yield chunk
+                        if chunk.finish_reason is not None:
+                            break

        except anyio.get_cancelled_exc_class():
            # TODO: TaskCancelled
@@ -429,31 +437,36 @@ class API:
            await self._send(command)
            del self._chat_completion_queues[command_id]

-    async def _collect_chat_completion_with_stats(
-        self, command_id: CommandId
-    ) -> BenchChatCompletionResponse:
-        import time
+    async def _generate_chat_stream(
+        self, command_id: CommandId, parse_gpt_oss: bool
+    ) -> AsyncGenerator[str, None]:
+        """Generate chat completion stream as JSON strings."""

-        from exo.shared.types.api import FinishReason
+        async for chunk in self._chat_chunk_stream(command_id, parse_gpt_oss):
+            chunk_response: ChatCompletionResponse = chunk_to_response(
+                chunk, command_id
+            )
+            logger.debug(f"chunk_response: {chunk_response}")
+
+            yield f"data: {chunk_response.model_dump_json()}\n\n"
+
+            if chunk.finish_reason is not None:
+                yield "data: [DONE]\n\n"
+
+    async def _collect_chat_completion(
+        self, command_id: CommandId, parse_gpt_oss: bool
+    ) -> ChatCompletionResponse:
+        """Collect all token chunks for a chat completion and return a single response."""

        text_parts: list[str] = []
        model: str | None = None
        finish_reason: FinishReason | None = None

-        stats: GenerationStats | None = None
-
-        async for chunk in self._token_chunk_stream(command_id):
-            if chunk.finish_reason == "error":
-                raise HTTPException(
-                    status_code=500,
-                    detail=chunk.error_message or "Internal server error",
-                )
-
+        async for chunk in self._chat_chunk_stream(command_id, parse_gpt_oss):
            if model is None:
                model = chunk.model

            text_parts.append(chunk.text)
-            stats = chunk.stats or stats

            if chunk.finish_reason is not None:
                finish_reason = chunk.finish_reason
@@ -461,7 +474,7 @@ class API:
        combined_text = "".join(text_parts)
        assert model is not None

-        resp = BenchChatCompletionResponse(
+        return ChatCompletionResponse(
            id=command_id,
            created=int(time.time()),
            model=model,
@@ -469,14 +482,13 @@ class API:
                ChatCompletionChoice(
                    index=0,
                    message=ChatCompletionMessage(
-                        role="assistant", content=combined_text
+                        role="assistant",
+                        content=combined_text,
                    ),
                    finish_reason=finish_reason,
                )
            ],
-            generation_stats=stats,
        )
-        return resp

    async def _trigger_notify_user_to_download_model(self, model_id: str) -> None:
        logger.warning(
@@ -486,146 +498,32 @@ class API:
    async def chat_completions(
        self, payload: ChatCompletionTaskParams
    ) -> ChatCompletionResponse | StreamingResponse:
-        """OpenAI Chat Completions API - adapter."""
-        internal_params = chat_request_to_internal(payload)
-        model_meta = await resolve_model_meta(internal_params.model)
-        internal_params.model = model_meta.model_id
-
-        if not any(
-            instance.shard_assignments.model_id == internal_params.model
-            for instance in self.state.instances.values()
-        ):
-            await self._trigger_notify_user_to_download_model(internal_params.model)
-            raise HTTPException(
-                status_code=404,
-                detail=f"No instance found for model {internal_params.model}",
-            )
-
-        command = ChatCompletion(request_params=internal_params)
-        await self._send(command)
-
-        if payload.stream:
-            return StreamingResponse(
-                generate_chat_stream(
-                    command.command_id,
-                    self._token_chunk_stream(command.command_id),
-                ),
-                media_type="text/event-stream",
-            )
-
-        try:
-            return await collect_chat_response(
-                command.command_id,
-                self._token_chunk_stream(command.command_id),
-            )
-        except ValueError as e:
-            raise HTTPException(status_code=500, detail=str(e)) from e
-
-    async def bench_chat_completions(
-        self, payload: BenchChatCompletionTaskParams
-    ) -> BenchChatCompletionResponse:
-        # Convert to internal format (BenchChatCompletionTaskParams extends ChatCompletionTaskParams)
-        internal_params = chat_request_to_internal(payload)
-        model_meta = await resolve_model_meta(internal_params.model)
-        internal_params.model = model_meta.model_id
-
-        if not any(
-            instance.shard_assignments.model_id == internal_params.model
-            for instance in self.state.instances.values()
-        ):
-            await self._trigger_notify_user_to_download_model(internal_params.model)
-            raise HTTPException(
-                status_code=404,
-                detail=f"No instance found for model {internal_params.model}",
-            )
-
-        internal_params.stream = False
-
-        command = ChatCompletion(request_params=internal_params)
-        await self._send(command)
-
-        response = await self._collect_chat_completion_with_stats(command.command_id)
-        return response
-
-    async def claude_messages(
-        self, payload: ClaudeMessagesRequest
-    ) -> ClaudeMessagesResponse | StreamingResponse:
-        """Claude Messages API - adapter."""
-        internal_params = claude_request_to_internal(payload)
-        model_meta = await resolve_model_meta(internal_params.model)
-        internal_params.model = model_meta.model_id
-
-        if not any(
-            instance.shard_assignments.model_id == internal_params.model
-            for instance in self.state.instances.values()
-        ):
-            await self._trigger_notify_user_to_download_model(internal_params.model)
-            raise HTTPException(
-                status_code=404,
-                detail=f"No instance found for model {internal_params.model}",
-            )
-
-        command = ChatCompletion(request_params=internal_params)
-        await self._send(command)
-
-        if payload.stream:
-            return StreamingResponse(
-                generate_claude_stream(
-                    command.command_id,
-                    payload.model,
-                    self._token_chunk_stream(command.command_id),
-                ),
-                media_type="text/event-stream",
-            )
-
-        try:
-            return await collect_claude_response(
-                command.command_id,
-                payload.model,
-                self._token_chunk_stream(command.command_id),
-            )
-        except ValueError as e:
-            raise HTTPException(status_code=500, detail=str(e)) from e
-
-    async def openai_responses(
-        self, payload: ResponsesRequest
-    ) -> ResponsesResponse | StreamingResponse:
-        """OpenAI Responses API - native format (no conversion needed)."""
+        """Handle chat completions, supporting both streaming and non-streaming responses."""
        model_meta = await resolve_model_meta(payload.model)
-        # Update model to resolved model_id
-        request_params = payload.model_copy(update={"model": model_meta.model_id})
+        payload.model = model_meta.model_id
+        parse_gpt_oss = "gpt-oss" in model_meta.model_id.lower()
+        logger.info(f"{parse_gpt_oss=}")

        if not any(
-            instance.shard_assignments.model_id == request_params.model
+            instance.shard_assignments.model_id == payload.model
            for instance in self.state.instances.values()
        ):
-            await self._trigger_notify_user_to_download_model(request_params.model)
+            await self._trigger_notify_user_to_download_model(payload.model)
            raise HTTPException(
-                status_code=404,
-                detail=f"No instance found for model {request_params.model}",
+                status_code=404, detail=f"No instance found for model {payload.model}"
            )

-        command = ChatCompletion(request_params=request_params)
+        command = ChatCompletion(
+            request_params=payload,
+        )
        await self._send(command)
-
        if payload.stream:
            return StreamingResponse(
-                generate_responses_stream(
-                    command.command_id,
-                    payload.model,
-                    self._token_chunk_stream(command.command_id),
-                ),
+                self._generate_chat_stream(command.command_id, parse_gpt_oss),
                media_type="text/event-stream",
            )

-        try:
-            return await collect_responses_response(
-                command.command_id,
-                payload.model,
-                self._token_chunk_stream(command.command_id),
-            )
-        except ValueError as e:
-            raise HTTPException(status_code=500, detail=str(e)) from e
+        return await self._collect_chat_completion(command.command_id, parse_gpt_oss)

    def _calculate_total_available_memory(self) -> Memory:
        """Calculate total available memory across all nodes in bytes."""
@@ -686,14 +584,14 @@ class API:
                for idx, event in self.event_buffer.drain_indexed():
                    self._event_log.append(event)
                    self.state = apply(self.state, IndexedEvent(event=event, idx=idx))
-                    if isinstance(event, ChunkGenerated):
+                    if (
+                        isinstance(event, ChunkGenerated)
+                        and event.command_id in self._chat_completion_queues
+                    ):
                        assert isinstance(event.chunk, TokenChunk)
-                        queue = self._chat_completion_queues.get(event.command_id)
-                        if queue is not None:
-                            try:
-                                await queue.send(event.chunk)
-                            except BrokenResourceError:
-                                self._chat_completion_queues.pop(event.command_id, None)
+                        await self._chat_completion_queues[event.command_id].send(
+                            event.chunk
+                        )

    async def _pause_on_new_election(self):
        with self.election_receiver as ems:
--- a/src/exo/master/tests/test_api_error_handling.py
+++ b/src/exo/master/tests/test_api_error_handling.py
@@ -1,107 +0,0 @@
-# pyright: reportUnusedFunction=false, reportAny=false
-from typing import Any, get_args
-
-from fastapi import FastAPI, HTTPException
-from fastapi.testclient import TestClient
-
-from exo.shared.types.api import ErrorInfo, ErrorResponse, FinishReason
-from exo.shared.types.chunks import TokenChunk
-from exo.worker.tests.constants import MODEL_A_ID
-
-
-def test_http_exception_handler_formats_openai_style() -> None:
-    """Test that HTTPException is converted to OpenAI-style error format."""
-    from exo.master.api import API
-
-    app = FastAPI()
-
-    # Setup exception handler
-    api = object.__new__(API)
-    api.app = app
-    api._setup_exception_handlers()  # pyright: ignore[reportPrivateUsage]
-
-    # Add test routes that raise HTTPException
-    @app.get("/test-error")
-    async def _test_error() -> None:
-        raise HTTPException(status_code=500, detail="Test error message")
-
-    @app.get("/test-not-found")
-    async def _test_not_found() -> None:
-        raise HTTPException(status_code=404, detail="Resource not found")
-
-    client = TestClient(app)
-
-    # Test 500 error
-    response = client.get("/test-error")
-    assert response.status_code == 500
-    data: dict[str, Any] = response.json()
-    assert "error" in data
-    assert data["error"]["message"] == "Test error message"
-    assert data["error"]["type"] == "Internal Server Error"
-    assert data["error"]["code"] == 500
-
-    # Test 404 error
-    response = client.get("/test-not-found")
-    assert response.status_code == 404
-    data = response.json()
-    assert "error" in data
-    assert data["error"]["message"] == "Resource not found"
-    assert data["error"]["type"] == "Not Found"
-    assert data["error"]["code"] == 404
-
-
-def test_finish_reason_includes_error() -> None:
-    valid_reasons = get_args(FinishReason)
-    assert "error" in valid_reasons
-
-
-def test_token_chunk_with_error_fields() -> None:
-    chunk = TokenChunk(
-        idx=0,
-        model=MODEL_A_ID,
-        text="",
-        token_id=0,
-        finish_reason="error",
-        error_message="Something went wrong",
-    )
-
-    assert chunk.finish_reason == "error"
-    assert chunk.error_message == "Something went wrong"
-
-
-def test_token_chunk_without_error() -> None:
-    chunk = TokenChunk(
-        idx=1,
-        model=MODEL_A_ID,
-        text="Hello",
-        token_id=42,
-        finish_reason=None,
-    )
-
-    assert chunk.finish_reason is None
-    assert chunk.error_message is None
-
-
-def test_error_response_construction() -> None:
-    error_response = ErrorResponse(
-        error=ErrorInfo(
-            message="Generation failed",
-            type="InternalServerError",
-            code=500,
-        )
-    )
-
-    assert error_response.error.message == "Generation failed"
-    assert error_response.error.code == 500
-
-
-def test_normal_finish_reasons_still_work() -> None:
-    for reason in ["stop", "length", "tool_calls", "content_filter", "function_call"]:
-        chunk = TokenChunk(
-            idx=0,
-            model=MODEL_A_ID,
-            text="done",
-            token_id=100,
-            finish_reason=reason,  # type: ignore[arg-type]
-        )
-        assert chunk.finish_reason == reason
--- a/src/exo/master/tests/test_claude_api.py
+++ b/src/exo/master/tests/test_claude_api.py
@@ -1,283 +0,0 @@
-"""Tests for Claude Messages API conversion functions and types."""
-
-import json
-from typing import Any, cast
-
-import pydantic
-import pytest
-
-from exo.master.adapters.claude import (
-    claude_request_to_internal,
-    finish_reason_to_claude_stop_reason,
-)
-from exo.shared.types.claude_api import (
-    ClaudeContentBlockDeltaEvent,
-    ClaudeContentBlockStartEvent,
-    ClaudeContentBlockStopEvent,
-    ClaudeMessage,
-    ClaudeMessageDelta,
-    ClaudeMessageDeltaEvent,
-    ClaudeMessageDeltaUsage,
-    ClaudeMessagesRequest,
-    ClaudeMessageStart,
-    ClaudeMessageStartEvent,
-    ClaudeMessageStopEvent,
-    ClaudeTextBlock,
-    ClaudeTextDelta,
-    ClaudeUsage,
-)
-
-
-class TestFinishReasonToClaudeStopReason:
-    """Tests for finish_reason to Claude stop_reason mapping."""
-
-    def test_stop_maps_to_end_turn(self):
-        assert finish_reason_to_claude_stop_reason("stop") == "end_turn"
-
-    def test_length_maps_to_max_tokens(self):
-        assert finish_reason_to_claude_stop_reason("length") == "max_tokens"
-
-    def test_tool_calls_maps_to_tool_use(self):
-        assert finish_reason_to_claude_stop_reason("tool_calls") == "tool_use"
-
-    def test_function_call_maps_to_tool_use(self):
-        assert finish_reason_to_claude_stop_reason("function_call") == "tool_use"
-
-    def test_content_filter_maps_to_end_turn(self):
-        assert finish_reason_to_claude_stop_reason("content_filter") == "end_turn"
-
-    def test_none_returns_none(self):
-        assert finish_reason_to_claude_stop_reason(None) is None
-
-
-class TestClaudeRequestToInternal:
-    """Tests for converting Claude Messages API requests to ResponsesRequest."""
-
-    def test_basic_request_conversion(self):
-        request = ClaudeMessagesRequest(
-            model="claude-3-opus",
-            max_tokens=100,
-            messages=[
-                ClaudeMessage(role="user", content="Hello"),
-            ],
-        )
-        params = claude_request_to_internal(request)
-
-        assert params.model == "claude-3-opus"
-        assert params.max_output_tokens == 100
-        assert isinstance(params.input, list)
-        assert len(params.input) == 1
-        assert params.input[0].role == "user"
-        assert params.input[0].content == "Hello"
-        assert params.instructions is None
-
-    def test_request_with_system_string(self):
-        request = ClaudeMessagesRequest(
-            model="claude-3-opus",
-            max_tokens=100,
-            system="You are a helpful assistant.",
-            messages=[
-                ClaudeMessage(role="user", content="Hello"),
-            ],
-        )
-        params = claude_request_to_internal(request)
-
-        assert params.instructions == "You are a helpful assistant."
-        assert isinstance(params.input, list)
-        assert len(params.input) == 1
-        assert params.input[0].role == "user"
-        assert params.input[0].content == "Hello"
-
-    def test_request_with_system_text_blocks(self):
-        request = ClaudeMessagesRequest(
-            model="claude-3-opus",
-            max_tokens=100,
-            system=[
-                ClaudeTextBlock(text="You are helpful. "),
-                ClaudeTextBlock(text="Be concise."),
-            ],
-            messages=[
-                ClaudeMessage(role="user", content="Hello"),
-            ],
-        )
-        params = claude_request_to_internal(request)
-
-        assert params.instructions == "You are helpful. Be concise."
-        assert isinstance(params.input, list)
-        assert len(params.input) == 1
-
-    def test_request_with_content_blocks(self):
-        request = ClaudeMessagesRequest(
-            model="claude-3-opus",
-            max_tokens=100,
-            messages=[
-                ClaudeMessage(
-                    role="user",
-                    content=[
-                        ClaudeTextBlock(text="First part. "),
-                        ClaudeTextBlock(text="Second part."),
-                    ],
-                ),
-            ],
-        )
-        params = claude_request_to_internal(request)
-
-        assert isinstance(params.input, list)
-        assert len(params.input) == 1
-        assert params.input[0].content == "First part. Second part."
-
-    def test_request_with_multi_turn_conversation(self):
-        request = ClaudeMessagesRequest(
-            model="claude-3-opus",
-            max_tokens=100,
-            messages=[
-                ClaudeMessage(role="user", content="Hello"),
-                ClaudeMessage(role="assistant", content="Hi there!"),
-                ClaudeMessage(role="user", content="How are you?"),
-            ],
-        )
-        params = claude_request_to_internal(request)
-
-        assert isinstance(params.input, list)
-        assert len(params.input) == 3
-        assert params.input[0].role == "user"
-        assert params.input[1].role == "assistant"
-        assert params.input[2].role == "user"
-
-    def test_request_with_optional_parameters(self):
-        request = ClaudeMessagesRequest(
-            model="claude-3-opus",
-            max_tokens=100,
-            messages=[ClaudeMessage(role="user", content="Hello")],
-            temperature=0.7,
-            top_p=0.9,
-            top_k=40,
-            stop_sequences=["STOP", "END"],
-            stream=True,
-        )
-        params = claude_request_to_internal(request)
-
-        assert params.temperature == 0.7
-        assert params.top_p == 0.9
-        assert params.top_k == 40
-        assert params.stop == ["STOP", "END"]
-        assert params.stream is True
-
-
-class TestClaudeMessagesRequestValidation:
-    """Tests for Claude Messages API request validation."""
-
-    def test_request_requires_model(self):
-        with pytest.raises(pydantic.ValidationError):
-            ClaudeMessagesRequest.model_validate(
-                {
-                    "max_tokens": 100,
-                    "messages": [{"role": "user", "content": "Hello"}],
-                }
-            )
-
-    def test_request_requires_max_tokens(self):
-        with pytest.raises(pydantic.ValidationError):
-            ClaudeMessagesRequest.model_validate(
-                {
-                    "model": "claude-3-opus",
-                    "messages": [{"role": "user", "content": "Hello"}],
-                }
-            )
-
-    def test_request_requires_messages(self):
-        with pytest.raises(pydantic.ValidationError):
-            ClaudeMessagesRequest.model_validate(
-                {
-                    "model": "claude-3-opus",
-                    "max_tokens": 100,
-                }
-            )
-
-
-class TestClaudeStreamingEvents:
-    """Tests for Claude Messages API streaming event serialization."""
-
-    def test_message_start_event_format(self):
-        message = ClaudeMessageStart(
-            id="msg_123",
-            model="claude-3-opus",
-            content=[],
-            stop_reason=None,
-            usage=ClaudeUsage(input_tokens=10, output_tokens=0),
-        )
-        event = ClaudeMessageStartEvent(message=message)
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "message_start"
-        assert parsed["message"]["id"] == "msg_123"
-        assert parsed["message"]["type"] == "message"
-        assert parsed["message"]["role"] == "assistant"
-        assert parsed["message"]["model"] == "claude-3-opus"
-
-    def test_content_block_start_event_format(self):
-        event = ClaudeContentBlockStartEvent(
-            index=0,
-            content_block=ClaudeTextBlock(text=""),
-        )
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "content_block_start"
-        assert parsed["index"] == 0
-        assert parsed["content_block"]["type"] == "text"
-        assert parsed["content_block"]["text"] == ""
-
-    def test_content_block_delta_event_format(self):
-        event = ClaudeContentBlockDeltaEvent(
-            index=0,
-            delta=ClaudeTextDelta(text="Hello"),
-        )
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "content_block_delta"
-        assert parsed["index"] == 0
-        assert parsed["delta"]["type"] == "text_delta"
-        assert parsed["delta"]["text"] == "Hello"
-
-    def test_content_block_stop_event_format(self):
-        event = ClaudeContentBlockStopEvent(index=0)
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "content_block_stop"
-        assert parsed["index"] == 0
-
-    def test_message_delta_event_format(self):
-        event = ClaudeMessageDeltaEvent(
-            delta=ClaudeMessageDelta(stop_reason="end_turn"),
-            usage=ClaudeMessageDeltaUsage(output_tokens=25),
-        )
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "message_delta"
-        assert parsed["delta"]["stop_reason"] == "end_turn"
-        assert parsed["usage"]["output_tokens"] == 25
-
-    def test_message_stop_event_format(self):
-        event = ClaudeMessageStopEvent()
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "message_stop"
-
-    def test_sse_format(self):
-        """Test that SSE format is correctly generated."""
-        event = ClaudeContentBlockDeltaEvent(
-            index=0,
-            delta=ClaudeTextDelta(text="Hello"),
-        )
-        # Simulate the SSE format used in the streaming generator
-        sse_line = f"event: content_block_delta\ndata: {event.model_dump_json()}\n\n"
-
-        assert sse_line.startswith("event: content_block_delta\n")
-        assert "data: " in sse_line
-        assert sse_line.endswith("\n\n")
--- a/src/exo/master/tests/test_master.py
+++ b/src/exo/master/tests/test_master.py
@@ -7,6 +7,7 @@ from loguru import logger

 from exo.master.main import Master
 from exo.routing.router import get_node_id_keypair
+from exo.shared.types.api import ChatCompletionMessage, ChatCompletionTaskParams
 from exo.shared.types.commands import (
    ChatCompletion,
    CommandId,
@@ -23,7 +24,6 @@ from exo.shared.types.events import (
 )
 from exo.shared.types.memory import Memory
 from exo.shared.types.models import ModelId, ModelMetadata
-from exo.shared.types.openai_responses import ResponsesRequest
 from exo.shared.types.profiling import (
    MemoryPerformanceProfile,
    NodePerformanceProfile,
@@ -143,9 +143,13 @@ async def test_master():
                command=(
                    ChatCompletion(
                        command_id=CommandId(),
-                        request_params=ResponsesRequest(
+                        request_params=ChatCompletionTaskParams(
                            model="llama-3.2-1b",
-                            input="Hello, how are you?",
+                            messages=[
+                                ChatCompletionMessage(
+                                    role="user", content="Hello, how are you?"
+                                )
+                            ],
                        ),
                    )
                ),
@@ -196,9 +200,11 @@ async def test_master():
        assert isinstance(events[2].event, TaskCreated)
        assert events[2].event.task.task_status == TaskStatus.Pending
        assert isinstance(events[2].event.task, ChatCompletionTask)
-        assert events[2].event.task.task_params == ResponsesRequest(
+        assert events[2].event.task.task_params == ChatCompletionTaskParams(
            model="llama-3.2-1b",
-            input="Hello, how are you?",
+            messages=[
+                ChatCompletionMessage(role="user", content="Hello, how are you?")
+            ],
        )

        await master.shutdown()
--- a/src/exo/master/tests/test_openai_responses_api.py
+++ b/src/exo/master/tests/test_openai_responses_api.py
@@ -1,293 +0,0 @@
-"""Tests for OpenAI Responses API types.
-
-ResponsesRequest is the canonical internal type used throughout the pipeline.
-No conversion is needed for Responses API requests.
-"""
-
-import json
-from typing import Any, cast
-
-import pydantic
-import pytest
-
-from exo.shared.types.openai_responses import (
-    ResponseCompletedEvent,
-    ResponseContentPartAddedEvent,
-    ResponseCreatedEvent,
-    ResponseInputMessage,
-    ResponseMessageItem,
-    ResponseOutputItemAddedEvent,
-    ResponseOutputItemDoneEvent,
-    ResponseOutputText,
-    ResponsesRequest,
-    ResponsesResponse,
-    ResponseTextDeltaEvent,
-    ResponseTextDoneEvent,
-    ResponseUsage,
-)
-
-
-class TestResponsesRequestAsCanonicalType:
-    """Tests for ResponsesRequest as the canonical internal type."""
-
-    def test_string_input(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input="Hello, how are you?",
-        )
-
-        assert request.model == "gpt-4o"
-        assert request.input == "Hello, how are you?"
-        assert request.instructions is None
-
-    def test_message_array_input(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input=[
-                ResponseInputMessage(role="user", content="Hello"),
-                ResponseInputMessage(role="assistant", content="Hi there!"),
-                ResponseInputMessage(role="user", content="How are you?"),
-            ],
-        )
-
-        assert isinstance(request.input, list)
-        assert len(request.input) == 3
-        assert request.input[0].role == "user"
-        assert request.input[0].content == "Hello"
-        assert request.input[1].role == "assistant"
-        assert request.input[1].content == "Hi there!"
-        assert request.input[2].role == "user"
-        assert request.input[2].content == "How are you?"
-
-    def test_request_with_instructions(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input="Hello",
-            instructions="You are a helpful assistant. Be concise.",
-        )
-
-        assert request.input == "Hello"
-        assert request.instructions == "You are a helpful assistant. Be concise."
-
-    def test_request_with_optional_parameters(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input="Hello",
-            max_output_tokens=500,
-            temperature=0.8,
-            top_p=0.95,
-            stream=True,
-        )
-
-        assert request.max_output_tokens == 500
-        assert request.temperature == 0.8
-        assert request.top_p == 0.95
-        assert request.stream is True
-
-    def test_request_with_new_fields(self):
-        """Test the additional fields added for internal use."""
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input="Hello",
-            top_k=40,
-            seed=42,
-            stop=["STOP", "END"],
-            tools=[{"type": "function", "function": {"name": "test"}}],
-        )
-
-        assert request.top_k == 40
-        assert request.seed == 42
-        assert request.stop == ["STOP", "END"]
-        assert request.tools == [{"type": "function", "function": {"name": "test"}}]
-
-    def test_request_with_system_role_in_messages(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input=[
-                ResponseInputMessage(role="system", content="Be helpful"),
-                ResponseInputMessage(role="user", content="Hello"),
-            ],
-        )
-
-        assert isinstance(request.input, list)
-        assert len(request.input) == 2
-        assert request.input[0].role == "system"
-        assert request.input[1].role == "user"
-
-    def test_request_with_developer_role(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input=[
-                ResponseInputMessage(role="developer", content="Internal note"),
-                ResponseInputMessage(role="user", content="Hello"),
-            ],
-        )
-
-        assert isinstance(request.input, list)
-        assert len(request.input) == 2
-        assert request.input[0].role == "developer"
-
-
-class TestResponsesRequestValidation:
-    """Tests for OpenAI Responses API request validation."""
-
-    def test_request_requires_model(self):
-        with pytest.raises(pydantic.ValidationError):
-            ResponsesRequest.model_validate(
-                {
-                    "input": "Hello",
-                }
-            )
-
-    def test_request_requires_input(self):
-        with pytest.raises(pydantic.ValidationError):
-            ResponsesRequest.model_validate(
-                {
-                    "model": "gpt-4o",
-                }
-            )
-
-    def test_request_accepts_string_input(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input="Hello",
-        )
-        assert request.input == "Hello"
-
-    def test_request_accepts_message_array_input(self):
-        request = ResponsesRequest(
-            model="gpt-4o",
-            input=[ResponseInputMessage(role="user", content="Hello")],
-        )
-        assert len(request.input) == 1
-
-
-class TestResponsesStreamingEvents:
-    """Tests for OpenAI Responses API streaming event serialization."""
-
-    def test_response_created_event_format(self):
-        response = ResponsesResponse(
-            id="resp_123",
-            model="gpt-4o",
-            status="in_progress",
-            output=[],
-            output_text="",
-        )
-        event = ResponseCreatedEvent(response=response)
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "response.created"
-        assert parsed["response"]["id"] == "resp_123"
-        assert parsed["response"]["object"] == "response"
-        assert parsed["response"]["status"] == "in_progress"
-
-    def test_output_item_added_event_format(self):
-        item = ResponseMessageItem(
-            id="item_123",
-            content=[ResponseOutputText(text="")],
-            status="in_progress",
-        )
-        event = ResponseOutputItemAddedEvent(output_index=0, item=item)
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "response.output_item.added"
-        assert parsed["output_index"] == 0
-        assert parsed["item"]["type"] == "message"
-        assert parsed["item"]["id"] == "item_123"
-        assert parsed["item"]["role"] == "assistant"
-
-    def test_content_part_added_event_format(self):
-        part = ResponseOutputText(text="")
-        event = ResponseContentPartAddedEvent(
-            output_index=0,
-            content_index=0,
-            part=part,
-        )
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "response.content_part.added"
-        assert parsed["output_index"] == 0
-        assert parsed["content_index"] == 0
-        assert parsed["part"]["type"] == "output_text"
-
-    def test_text_delta_event_format(self):
-        event = ResponseTextDeltaEvent(
-            output_index=0,
-            content_index=0,
-            delta="Hello",
-        )
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "response.output_text.delta"
-        assert parsed["output_index"] == 0
-        assert parsed["content_index"] == 0
-        assert parsed["delta"] == "Hello"
-
-    def test_text_done_event_format(self):
-        event = ResponseTextDoneEvent(
-            output_index=0,
-            content_index=0,
-            text="Hello, world!",
-        )
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "response.output_text.done"
-        assert parsed["text"] == "Hello, world!"
-
-    def test_output_item_done_event_format(self):
-        item = ResponseMessageItem(
-            id="item_123",
-            content=[ResponseOutputText(text="Hello, world!")],
-            status="completed",
-        )
-        event = ResponseOutputItemDoneEvent(output_index=0, item=item)
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "response.output_item.done"
-        assert parsed["item"]["status"] == "completed"
-        assert parsed["item"]["content"][0]["text"] == "Hello, world!"
-
-    def test_response_completed_event_format(self):
-        item = ResponseMessageItem(
-            id="item_123",
-            content=[ResponseOutputText(text="Hello!")],
-            status="completed",
-        )
-        response = ResponsesResponse(
-            id="resp_123",
-            model="gpt-4o",
-            status="completed",
-            output=[item],
-            output_text="Hello!",
-            usage=ResponseUsage(input_tokens=10, output_tokens=5, total_tokens=15),
-        )
-        event = ResponseCompletedEvent(response=response)
-        json_str = event.model_dump_json()
-        parsed = cast(dict[str, Any], json.loads(json_str))
-
-        assert parsed["type"] == "response.completed"
-        assert parsed["response"]["status"] == "completed"
-        assert parsed["response"]["output_text"] == "Hello!"
-        assert parsed["response"]["usage"]["total_tokens"] == 15
-
-    def test_sse_format(self):
-        """Test that SSE format is correctly generated."""
-        event = ResponseTextDeltaEvent(
-            output_index=0,
-            content_index=0,
-            delta="Hello",
-        )
-        # Simulate the SSE format used in the streaming generator
-        sse_line = (
-            f"event: response.output_text.delta\ndata: {event.model_dump_json()}\n\n"
-        )
-
-        assert sse_line.startswith("event: response.output_text.delta\n")
-        assert "data: " in sse_line
-        assert sse_line.endswith("\n\n")
--- a/src/exo/shared/logging.py
+++ b/src/exo/shared/logging.py
@@ -29,11 +29,6 @@ class _InterceptHandler(logging.Handler):

 def logger_setup(log_file: Path | None, verbosity: int = 0):
    """Set up logging for this process - formatting, file handles, verbosity and output"""
-
-    logging.getLogger("exo_pyo3_bindings").setLevel(logging.WARNING)
-    logging.getLogger("httpx").setLevel(logging.WARNING)
-    logging.getLogger("httpcore").setLevel(logging.WARNING)
-
    logger.remove()

    # replace all stdlib loggers with _InterceptHandlers that log to loguru
--- a/src/exo/shared/models/model_cards.py
+++ b/src/exo/shared/models/model_cards.py
@@ -14,6 +14,32 @@ class ModelCard(CamelCaseModel):

 MODEL_CARDS: dict[str, ModelCard] = {
    # deepseek v3
+    # "deepseek-v3-0324:4bit": ModelCard(
+    #     short_id="deepseek-v3-0324:4bit",
+    #     model_id="mlx-community/DeepSeek-V3-0324-4bit",
+    #     name="DeepSeek V3 0324 (4-bit)",
+    #     description="""DeepSeek V3 is a large language model trained on the DeepSeek V3 dataset.""",
+    #     tags=[],
+    #     metadata=ModelMetadata(
+    #         model_id=ModelId("mlx-community/DeepSeek-V3-0324-4bit"),
+    #         pretty_name="DeepSeek V3 0324 (4-bit)",
+    #         storage_size=Memory.from_kb(409706307),
+    #         n_layers=61,
+    #     ),
+    # ),
+    # "deepseek-v3-0324": ModelCard(
+    #     short_id="deepseek-v3-0324",
+    #     model_id="mlx-community/DeepSeek-v3-0324-8bit",
+    #     name="DeepSeek V3 0324 (8-bit)",
+    #     description="""DeepSeek V3 is a large language model trained on the DeepSeek V3 dataset.""",
+    #     tags=[],
+    #     metadata=ModelMetadata(
+    #         model_id=ModelId("mlx-community/DeepSeek-v3-0324-8bit"),
+    #         pretty_name="DeepSeek V3 0324 (8-bit)",
+    #         storage_size=Memory.from_kb(754706307),
+    #         n_layers=61,
+    #     ),
+    # ),
    "deepseek-v3.1-4bit": ModelCard(
        short_id="deepseek-v3.1-4bit",
        model_id=ModelId("mlx-community/DeepSeek-V3.1-4bit"),
@@ -44,6 +70,63 @@ MODEL_CARDS: dict[str, ModelCard] = {
            supports_tensor=True,
        ),
    ),
+    # "deepseek-v3.2": ModelCard(
+    #     short_id="deepseek-v3.2",
+    #     model_id=ModelId("mlx-community/DeepSeek-V3.2-8bit"),
+    #     name="DeepSeek V3.2 (8-bit)",
+    #     description="""DeepSeek V3.2 is a large language model trained on the DeepSeek V3.2 dataset.""",
+    #     tags=[],
+    #     metadata=ModelMetadata(
+    #         model_id=ModelId("mlx-community/DeepSeek-V3.2-8bit"),
+    #         pretty_name="DeepSeek V3.2 (8-bit)",
+    #         storage_size=Memory.from_kb(754706307),
+    #         n_layers=61,
+    #         hidden_size=7168,
+    #     ),
+    # ),
+    # "deepseek-v3.2-4bit": ModelCard(
+    #     short_id="deepseek-v3.2-4bit",
+    #     model_id=ModelId("mlx-community/DeepSeek-V3.2-4bit"),
+    #     name="DeepSeek V3.2 (4-bit)",
+    #     description="""DeepSeek V3.2 is a large language model trained on the DeepSeek V3.2 dataset.""",
+    #     tags=[],
+    #     metadata=ModelMetadata(
+    #         model_id=ModelId("mlx-community/DeepSeek-V3.2-4bit"),
+    #         pretty_name="DeepSeek V3.2 (4-bit)",
+    #         storage_size=Memory.from_kb(754706307 // 2),  # TODO !!!!!
+    #         n_layers=61,
+    #         hidden_size=7168,
+    #     ),
+    # ),
+    # deepseek r1
+    # "deepseek-r1-0528-4bit": ModelCard(
+    #     short_id="deepseek-r1-0528-4bit",
+    #     model_id="mlx-community/DeepSeek-R1-0528-4bit",
+    #     name="DeepSeek-R1-0528 (4-bit)",
+    #     description="""DeepSeek R1 is a large language model trained on the DeepSeek R1 dataset.""",
+    #     tags=[],
+    #     metadata=ModelMetadata(
+    #         model_id=ModelId("mlx-community/DeepSeek-R1-0528-4bit"),
+    #         pretty_name="DeepSeek R1 671B (4-bit)",
+    #         storage_size=Memory.from_kb(409706307),
+    #         n_layers=61,
+    #         hidden_size=7168,
+    #     ),
+    # ),
+    # "deepseek-r1-0528": ModelCard(
+    #     short_id="deepseek-r1-0528",
+    #     model_id="mlx-community/DeepSeek-R1-0528-8bit",
+    #     name="DeepSeek-R1-0528 (8-bit)",
+    #     description="""DeepSeek R1 is a large language model trained on the DeepSeek R1 dataset.""",
+    #     tags=[],
+    #     metadata=ModelMetadata(
+    #         model_id=ModelId("mlx-community/DeepSeek-R1-0528-8bit"),
+    #         pretty_name="DeepSeek R1 671B (8-bit)",
+    #         storage_size=Memory.from_bytes(754998771712),
+    #         n_layers=61,
+    # .       hidden_size=7168,
+    #     ),
+    # ),
    # kimi k2
    "kimi-k2-instruct-4bit": ModelCard(
        short_id="kimi-k2-instruct-4bit",
@@ -425,24 +508,23 @@ MODEL_CARDS: dict[str, ModelCard] = {
            supports_tensor=True,
        ),
    ),
-    "gpt-oss-20b-MXFP4-Q8": ModelCard(
-        short_id="gpt-oss-20b-MXFP4-Q8",
-        model_id=ModelId("mlx-community/gpt-oss-20b-MXFP4-Q8"),
-        name="GPT-OSS 20B (MXFP4-Q8, MLX)",
-        description="""OpenAI's GPT-OSS 20B is a medium-sized MoE model for lower-latency and local or specialized use cases; this variant is a 4-bit MLX conversion for Apple Silicon.""",
+    "gpt-oss-20b-4bit": ModelCard(
+        short_id="gpt-oss-20b-4bit",
+        model_id=ModelId("mlx-community/gpt-oss-20b-MXFP4-Q4"),
+        name="GPT-OSS 20B (MXFP4-Q4, MLX)",
+        description="""OpenAI's GPT-OSS 20B is a medium-sized MoE model for lower-latency and local or specialized use cases; this MLX variant uses MXFP4 4-bit quantization.""",
        tags=[],
        metadata=ModelMetadata(
-            model_id=ModelId("mlx-community/gpt-oss-20b-MXFP4-Q8"),
-            pretty_name="GPT-OSS 20B (MXFP4-Q8, MLX)",
+            model_id=ModelId("mlx-community/gpt-oss-20b-MXFP4-Q4"),
+            pretty_name="GPT-OSS 20B (MXFP4-Q4, MLX)",
            storage_size=Memory.from_kb(11_744_051),
            n_layers=24,
            hidden_size=2880,
            supports_tensor=True,
        ),
    ),
-    # glm 4.5
+    # Needs to be quantized g32 or g16.
    "glm-4.5-air-8bit": ModelCard(
-        # Needs to be quantized g32 or g16 to work with tensor parallel
        short_id="glm-4.5-air-8bit",
        model_id=ModelId("mlx-community/GLM-4.5-Air-8bit"),
        name="GLM 4.5 Air 8bit",
@@ -472,81 +554,19 @@ MODEL_CARDS: dict[str, ModelCard] = {
            supports_tensor=True,
        ),
    ),
-    # glm 4.7
-    "glm-4.7-4bit": ModelCard(
-        short_id="glm-4.7-4bit",
-        model_id=ModelId("mlx-community/GLM-4.7-4bit"),
-        name="GLM 4.7 4bit",
-        description="GLM 4.7 4bit",
-        tags=[],
-        metadata=ModelMetadata(
-            model_id=ModelId("mlx-community/GLM-4.7-4bit"),
-            pretty_name="GLM 4.7 4bit",
-            storage_size=Memory.from_bytes(198556925568),
-            n_layers=91,
-            hidden_size=5120,
-            supports_tensor=True,
-        ),
-    ),
-    "glm-4.7-6bit": ModelCard(
-        short_id="glm-4.7-6bit",
-        model_id=ModelId("mlx-community/GLM-4.7-6bit"),
-        name="GLM 4.7 6bit",
-        description="GLM 4.7 6bit",
-        tags=[],
-        metadata=ModelMetadata(
-            model_id=ModelId("mlx-community/GLM-4.7-6bit"),
-            pretty_name="GLM 4.7 6bit",
-            storage_size=Memory.from_bytes(286737579648),
-            n_layers=91,
-            hidden_size=5120,
-            supports_tensor=True,
-        ),
-    ),
-    "glm-4.7-8bit-gs32": ModelCard(
-        short_id="glm-4.7-8bit-gs32",
-        model_id=ModelId("mlx-community/GLM-4.7-8bit-gs32"),
-        name="GLM 4.7 8bit (gs32)",
-        description="GLM 4.7 8bit (gs32)",
-        tags=[],
-        metadata=ModelMetadata(
-            model_id=ModelId("mlx-community/GLM-4.7-8bit-gs32"),
-            pretty_name="GLM 4.7 8bit (gs32)",
-            storage_size=Memory.from_bytes(396963397248),
-            n_layers=91,
-            hidden_size=5120,
-            supports_tensor=True,
-        ),
-    ),
-    # minimax-m2
-    "minimax-m2.1-8bit": ModelCard(
-        short_id="minimax-m2.1-8bit",
-        model_id=ModelId("mlx-community/MiniMax-M2.1-8bit"),
-        name="MiniMax M2.1 8bit",
-        description="MiniMax M2.1 8bit",
-        tags=[],
-        metadata=ModelMetadata(
-            model_id=ModelId("mlx-community/MiniMax-M2.1-8bit"),
-            pretty_name="MiniMax M2.1 8bit",
-            storage_size=Memory.from_bytes(242986745856),
-            n_layers=61,
-            hidden_size=3072,
-            supports_tensor=True,
-        ),
-    ),
-    "minimax-m2.1-3bit": ModelCard(
-        short_id="minimax-m2.1-3bit",
-        model_id=ModelId("mlx-community/MiniMax-M2.1-3bit"),
-        name="MiniMax M2.1 3bit",
-        description="MiniMax M2.1 3bit",
-        tags=[],
-        metadata=ModelMetadata(
-            model_id=ModelId("mlx-community/MiniMax-M2.1-3bit"),
-            pretty_name="MiniMax M2.1 3bit",
-            storage_size=Memory.from_bytes(100086644736),
-            n_layers=61,
-            hidden_size=3072,
-            supports_tensor=True,
-        ),
-    ),
+    # "devstral-2-123b-instruct-2512-8bit": ModelCard(
+    #     short_id="devstral-2-123b-instruct-2512-8bit",
+    #     model_id=ModelId("mlx-community/Devstral-2-123B-Instruct-2512-8bit"),
+    #     name="Devstral 2 123B Instruct 2512 (8-bit, MLX)",
+    #     description="""Mistral AI's Devstral 2 123B Instruct (2512) is an agentic coding model.""",
+    #     tags=[],
+    #     metadata=ModelMetadata(
+    #         model_id=ModelId("mlx-community/Devstral-2-123B-Instruct-2512-8bit"),
+    #         pretty_name="Devstral 2 123B Instruct 2512 (8-bit, MLX)",
+    #         storage_size=Memory.from_kb(133_000_000),
+    #         n_layers=88,
+    #         hidden_size=12288,
+    #         supports_tensor=True,
+    #     ),
+    # ),
 }
--- a/src/exo/shared/tests/test_apply/test_apply_node_download.py
+++ b/src/exo/shared/tests/test_apply/test_apply_node_download.py
@@ -2,7 +2,6 @@ from exo.shared.apply import apply_node_download_progress
 from exo.shared.tests.conftest import get_pipeline_shard_metadata
 from exo.shared.types.common import NodeId
 from exo.shared.types.events import NodeDownloadProgress
-from exo.shared.types.memory import Memory
 from exo.shared.types.state import State
 from exo.shared.types.worker.downloads import DownloadCompleted
 from exo.worker.tests.constants import MODEL_A_ID, MODEL_B_ID
@@ -14,7 +13,6 @@ def test_apply_node_download_progress():
    event = DownloadCompleted(
        node_id=NodeId("node-1"),
        shard_metadata=shard1,
-        total_bytes=Memory(),
    )

    new_state = apply_node_download_progress(
@@ -30,12 +28,10 @@ def test_apply_two_node_download_progress():
    event1 = DownloadCompleted(
        node_id=NodeId("node-1"),
        shard_metadata=shard1,
-        total_bytes=Memory(),
    )
    event2 = DownloadCompleted(
        node_id=NodeId("node-1"),
        shard_metadata=shard2,
-        total_bytes=Memory(),
    )
    state = State(downloads={NodeId("node-1"): [event1]})

--- a/src/exo/shared/types/api.py
+++ b/src/exo/shared/types/api.py
@@ -5,27 +5,15 @@ from pydantic import BaseModel, Field, field_validator
 from pydantic_core import PydanticUseDefault

 from exo.shared.types.common import CommandId
-from exo.shared.types.memory import Memory
 from exo.shared.types.models import ModelId, ModelMetadata
 from exo.shared.types.worker.instances import Instance, InstanceId, InstanceMeta
 from exo.shared.types.worker.shards import Sharding

 FinishReason = Literal[
-    "stop", "length", "tool_calls", "content_filter", "function_call", "error"
+    "stop", "length", "tool_calls", "content_filter", "function_call"
 ]


-class ErrorInfo(BaseModel):
-    message: str
-    type: str
-    param: str | None = None
-    code: int
-
-
-class ErrorResponse(BaseModel):
-    error: ErrorInfo
-
-
 class ModelListModel(BaseModel):
    id: str
    object: str = "model"
@@ -63,10 +51,6 @@ class ChatCompletionMessage(BaseModel):
    function_call: dict[str, Any] | None = None


-class BenchChatCompletionMessage(ChatCompletionMessage):
-    pass
-
-
 class TopLogprobItem(BaseModel):
    token: str
    logprob: float
@@ -129,18 +113,6 @@ class ChatCompletionResponse(BaseModel):
    service_tier: str | None = None


-class GenerationStats(BaseModel):
-    prompt_tps: float
-    generation_tps: float
-    prompt_tokens: int
-    generation_tokens: int
-    peak_memory_usage: Memory
-
-
-class BenchChatCompletionResponse(ChatCompletionResponse):
-    generation_stats: GenerationStats | None = None
-
-
 class ChatCompletionTaskParams(BaseModel):
    model: str
    frequency_penalty: float | None = None
@@ -157,17 +129,10 @@ class ChatCompletionTaskParams(BaseModel):
    stream: bool = False
    temperature: float | None = None
    top_p: float | None = None
-    top_k: int | None = None
    tools: list[dict[str, Any]] | None = None
    tool_choice: str | dict[str, Any] | None = None
    parallel_tool_calls: bool | None = None
    user: str | None = None
-    # When True, continue the last assistant message without EOS tokens
-    continue_from_prefix: bool = False
-
-
-class BenchChatCompletionTaskParams(ChatCompletionTaskParams):
-    pass


 class PlaceInstanceParams(BaseModel):
--- a/src/exo/shared/types/chunks.py
+++ b/src/exo/shared/types/chunks.py
@@ -1,6 +1,5 @@
 from enum import Enum

-from exo.shared.types.api import GenerationStats, TopLogprobItem
 from exo.utils.pydantic_ext import TaggedModel

 from .api import FinishReason
@@ -20,11 +19,7 @@ class BaseChunk(TaggedModel):
 class TokenChunk(BaseChunk):
    text: str
    token_id: int
-    logprob: float | None = None  # Log probability of the selected token
-    top_logprobs: list[TopLogprobItem] | None = None  # Top-k alternative tokens
    finish_reason: FinishReason | None = None
-    stats: GenerationStats | None = None
-    error_message: str | None = None


 class ImageChunk(BaseChunk):
--- a/src/exo/shared/types/claude_api.py
+++ b/src/exo/shared/types/claude_api.py
@@ -1,168 +0,0 @@
-"""Claude Messages API types for request/response conversion."""
-
-from typing import Literal
-
-from pydantic import BaseModel, Field
-
-# Type aliases
-ClaudeRole = Literal["user", "assistant"]
-ClaudeStopReason = Literal["end_turn", "max_tokens", "stop_sequence", "tool_use"]
-
-
-# Content block types
-class ClaudeTextBlock(BaseModel, frozen=True):
-    """Text content block in Claude Messages API."""
-
-    type: Literal["text"] = "text"
-    text: str
-
-
-class ClaudeImageSource(BaseModel, frozen=True):
-    """Image source for Claude image blocks."""
-
-    type: Literal["base64", "url"]
-    media_type: str | None = None
-    data: str | None = None
-    url: str | None = None
-
-
-class ClaudeImageBlock(BaseModel, frozen=True):
-    """Image content block in Claude Messages API."""
-
-    type: Literal["image"] = "image"
-    source: ClaudeImageSource
-
-
-ClaudeContentBlock = ClaudeTextBlock | ClaudeImageBlock
-
-
-# Request types
-class ClaudeMessage(BaseModel, frozen=True):
-    """Message in Claude Messages API request."""
-
-    role: ClaudeRole
-    content: str | list[ClaudeContentBlock]
-
-
-class ClaudeMessagesRequest(BaseModel):
-    """Request body for Claude Messages API."""
-
-    model: str
-    max_tokens: int
-    messages: list[ClaudeMessage]
-    system: str | list[ClaudeTextBlock] | None = None
-    stop_sequences: list[str] | None = None
-    stream: bool = False
-    temperature: float | None = None
-    top_p: float | None = None
-    top_k: int | None = None
-    metadata: dict[str, str] | None = None
-
-
-# Response types
-class ClaudeUsage(BaseModel, frozen=True):
-    """Token usage in Claude Messages API response."""
-
-    input_tokens: int
-    output_tokens: int
-
-
-class ClaudeMessagesResponse(BaseModel, frozen=True):
-    """Response body for Claude Messages API."""
-
-    id: str
-    type: Literal["message"] = "message"
-    role: Literal["assistant"] = "assistant"
-    content: list[ClaudeTextBlock]
-    model: str
-    stop_reason: ClaudeStopReason | None = None
-    stop_sequence: str | None = None
-    usage: ClaudeUsage
-
-
-# Streaming event types
-class ClaudeMessageStart(BaseModel, frozen=True):
-    """Partial message in message_start event."""
-
-    id: str
-    type: Literal["message"] = "message"
-    role: Literal["assistant"] = "assistant"
-    content: list[ClaudeTextBlock] = Field(default_factory=list)
-    model: str
-    stop_reason: ClaudeStopReason | None = None
-    stop_sequence: str | None = None
-    usage: ClaudeUsage
-
-
-class ClaudeMessageStartEvent(BaseModel, frozen=True):
-    """Event sent at start of message stream."""
-
-    type: Literal["message_start"] = "message_start"
-    message: ClaudeMessageStart
-
-
-class ClaudeContentBlockStartEvent(BaseModel, frozen=True):
-    """Event sent at start of a content block."""
-
-    type: Literal["content_block_start"] = "content_block_start"
-    index: int
-    content_block: ClaudeTextBlock
-
-
-class ClaudeTextDelta(BaseModel, frozen=True):
-    """Delta for text content block."""
-
-    type: Literal["text_delta"] = "text_delta"
-    text: str
-
-
-class ClaudeContentBlockDeltaEvent(BaseModel, frozen=True):
-    """Event sent for content block delta."""
-
-    type: Literal["content_block_delta"] = "content_block_delta"
-    index: int
-    delta: ClaudeTextDelta
-
-
-class ClaudeContentBlockStopEvent(BaseModel, frozen=True):
-    """Event sent at end of a content block."""
-
-    type: Literal["content_block_stop"] = "content_block_stop"
-    index: int
-
-
-class ClaudeMessageDeltaUsage(BaseModel, frozen=True):
-    """Usage in message_delta event."""
-
-    output_tokens: int
-
-
-class ClaudeMessageDelta(BaseModel, frozen=True):
-    """Delta in message_delta event."""
-
-    stop_reason: ClaudeStopReason | None = None
-    stop_sequence: str | None = None
-
-
-class ClaudeMessageDeltaEvent(BaseModel, frozen=True):
-    """Event sent with final message delta."""
-
-    type: Literal["message_delta"] = "message_delta"
-    delta: ClaudeMessageDelta
-    usage: ClaudeMessageDeltaUsage
-
-
-class ClaudeMessageStopEvent(BaseModel, frozen=True):
-    """Event sent at end of message stream."""
-
-    type: Literal["message_stop"] = "message_stop"
-
-
-ClaudeStreamEvent = (
-    ClaudeMessageStartEvent
-    | ClaudeContentBlockStartEvent
-    | ClaudeContentBlockDeltaEvent
-    | ClaudeContentBlockStopEvent
-    | ClaudeMessageDeltaEvent
-    | ClaudeMessageStopEvent
-)
--- a/src/exo/shared/types/commands.py
+++ b/src/exo/shared/types/commands.py
@@ -1,8 +1,8 @@
 from pydantic import Field

+from exo.shared.types.api import ChatCompletionTaskParams
 from exo.shared.types.common import CommandId, NodeId
 from exo.shared.types.models import ModelMetadata
-from exo.shared.types.openai_responses import ResponsesRequest
 from exo.shared.types.worker.instances import Instance, InstanceId, InstanceMeta
 from exo.shared.types.worker.shards import Sharding
 from exo.utils.pydantic_ext import CamelCaseModel, TaggedModel
@@ -17,7 +17,7 @@ class TestCommand(BaseCommand):


 class ChatCompletion(BaseCommand):
-    request_params: ResponsesRequest
+    request_params: ChatCompletionTaskParams


 class PlaceInstance(BaseCommand):
--- a/src/exo/shared/types/openai_responses.py
+++ b/src/exo/shared/types/openai_responses.py
@@ -1,190 +0,0 @@
-"""OpenAI Responses API types for request/response conversion.
-
-ResponsesRequest serves as both:
-1. The external API request type for /v1/responses
-2. The canonical internal type used throughout the inference pipeline
-
-All external API formats (Chat Completions, Claude) are converted to
-ResponsesRequest at the API boundary.
-"""
-
-import time
-from typing import Any, Literal
-
-from pydantic import BaseModel, Field
-
-# Type aliases
-ResponseStatus = Literal["completed", "failed", "in_progress", "incomplete"]
-ResponseRole = Literal["user", "assistant", "system", "developer"]
-
-
-# Request types
-class ResponseInputMessage(BaseModel, frozen=True):
-    """Input message for Responses API.
-
-    This is also used as the internal message format throughout the pipeline.
-    """
-
-    role: ResponseRole
-    content: str
-
-
-class ResponsesRequest(BaseModel):
-    """Request body for OpenAI Responses API.
-
-    This is also the canonical internal task params format used throughout
-    the inference pipeline. All external API formats are converted to this
-    format at the API boundary.
-
-    Field mapping from other APIs:
-    - input: Replaces 'messages' from Chat Completions
-    - instructions: System message, extracted from messages or Claude's 'system'
-    - max_output_tokens: Replaces 'max_tokens' from Chat Completions
-    """
-
-    model: str
-    input: str | list[ResponseInputMessage]
-    instructions: str | None = None
-    max_output_tokens: int | None = None
-    temperature: float | None = None
-    top_p: float | None = None
-    top_k: int | None = None
-    stop: str | list[str] | None = None
-    seed: int | None = None
-    stream: bool = False
-    # Tools support
-    tools: list[dict[str, Any]] | None = None
-    # previous_response_id not supported in MVP
-    metadata: dict[str, str] | None = None
-    # When True, continue the last assistant message without EOS tokens
-    continue_from_prefix: bool = False
-
-
-# Response types
-class ResponseOutputText(BaseModel, frozen=True):
-    """Text content in response output."""
-
-    type: Literal["output_text"] = "output_text"
-    text: str
-    annotations: list[dict[str, str]] = Field(default_factory=list)
-
-
-class ResponseMessageItem(BaseModel, frozen=True):
-    """Message item in response output array."""
-
-    type: Literal["message"] = "message"
-    id: str
-    role: Literal["assistant"] = "assistant"
-    content: list[ResponseOutputText]
-    status: ResponseStatus = "completed"
-
-
-ResponseItem = ResponseMessageItem  # Can expand for function_call, reasoning, etc.
-
-
-class ResponseUsage(BaseModel, frozen=True):
-    """Token usage in Responses API response."""
-
-    input_tokens: int
-    output_tokens: int
-    total_tokens: int
-
-
-class ResponsesResponse(BaseModel, frozen=True):
-    """Response body for OpenAI Responses API."""
-
-    id: str
-    object: Literal["response"] = "response"
-    created_at: int = Field(default_factory=lambda: int(time.time()))
-    status: ResponseStatus = "completed"
-    model: str
-    output: list[ResponseItem]
-    output_text: str
-    usage: ResponseUsage | None = None
-
-
-# Streaming event types
-class ResponseCreatedEvent(BaseModel, frozen=True):
-    """Event sent when response is created."""
-
-    type: Literal["response.created"] = "response.created"
-    response: ResponsesResponse
-
-
-class ResponseInProgressEvent(BaseModel, frozen=True):
-    """Event sent when response starts processing."""
-
-    type: Literal["response.in_progress"] = "response.in_progress"
-    response: ResponsesResponse
-
-
-class ResponseOutputItemAddedEvent(BaseModel, frozen=True):
-    """Event sent when an output item is added."""
-
-    type: Literal["response.output_item.added"] = "response.output_item.added"
-    output_index: int
-    item: ResponseItem
-
-
-class ResponseContentPartAddedEvent(BaseModel, frozen=True):
-    """Event sent when a content part is added."""
-
-    type: Literal["response.content_part.added"] = "response.content_part.added"
-    output_index: int
-    content_index: int
-    part: ResponseOutputText
-
-
-class ResponseTextDeltaEvent(BaseModel, frozen=True):
-    """Event sent for text delta during streaming."""
-
-    type: Literal["response.output_text.delta"] = "response.output_text.delta"
-    output_index: int
-    content_index: int
-    delta: str
-
-
-class ResponseTextDoneEvent(BaseModel, frozen=True):
-    """Event sent when text content is done."""
-
-    type: Literal["response.output_text.done"] = "response.output_text.done"
-    output_index: int
-    content_index: int
-    text: str
-
-
-class ResponseContentPartDoneEvent(BaseModel, frozen=True):
-    """Event sent when a content part is done."""
-
-    type: Literal["response.content_part.done"] = "response.content_part.done"
-    output_index: int
-    content_index: int
-    part: ResponseOutputText
-
-
-class ResponseOutputItemDoneEvent(BaseModel, frozen=True):
-    """Event sent when an output item is done."""
-
-    type: Literal["response.output_item.done"] = "response.output_item.done"
-    output_index: int
-    item: ResponseItem
-
-
-class ResponseCompletedEvent(BaseModel, frozen=True):
-    """Event sent when response is completed."""
-
-    type: Literal["response.completed"] = "response.completed"
-    response: ResponsesResponse
-
-
-ResponsesStreamEvent = (
-    ResponseCreatedEvent
-    | ResponseInProgressEvent
-    | ResponseOutputItemAddedEvent
-    | ResponseContentPartAddedEvent
-    | ResponseTextDeltaEvent
-    | ResponseTextDoneEvent
-    | ResponseContentPartDoneEvent
-    | ResponseOutputItemDoneEvent
-    | ResponseCompletedEvent
-)
--- a/src/exo/shared/types/tasks.py
+++ b/src/exo/shared/types/tasks.py
@@ -2,8 +2,8 @@ from enum import Enum

 from pydantic import Field

+from exo.shared.types.api import ChatCompletionTaskParams
 from exo.shared.types.common import CommandId, Id
-from exo.shared.types.openai_responses import ResponsesRequest
 from exo.shared.types.worker.instances import BoundInstance, InstanceId
 from exo.shared.types.worker.runners import RunnerId
 from exo.shared.types.worker.shards import ShardMetadata
@@ -50,7 +50,7 @@ class StartWarmup(BaseTask):  # emitted by Worker

 class ChatCompletion(BaseTask):  # emitted by Master
    command_id: CommandId
-    task_params: ResponsesRequest
+    task_params: ChatCompletionTaskParams

    error_type: str | None = Field(default=None)
    error_message: str | None = Field(default=None)
--- a/src/exo/shared/types/worker/downloads.py
+++ b/src/exo/shared/types/worker/downloads.py
@@ -28,7 +28,7 @@ class DownloadPending(BaseDownloadProgress):


 class DownloadCompleted(BaseDownloadProgress):
-    total_bytes: Memory
+    pass


 class DownloadFailed(BaseDownloadProgress):
--- a/src/exo/shared/types/worker/runner_response.py
+++ b/src/exo/shared/types/worker/runner_response.py
@@ -1,4 +1,4 @@
-from exo.shared.types.api import FinishReason, GenerationStats, TopLogprobItem
+from exo.shared.types.api import FinishReason
 from exo.utils.pydantic_ext import TaggedModel


@@ -13,10 +13,8 @@ class TokenizedResponse(BaseRunnerResponse):
 class GenerationResponse(BaseRunnerResponse):
    text: str
    token: int
-    logprob: float | None = None  # Log probability of the selected token
-    top_logprobs: list[TopLogprobItem] | None = None  # Top-k alternative tokens
+    # logprobs: list[float] | None = None # too big. we can change to be top-k
    finish_reason: FinishReason | None = None
-    stats: GenerationStats | None = None


 class FinishedResponse(BaseRunnerResponse):
--- a/src/exo/shared/types/worker/runners.py
+++ b/src/exo/shared/types/worker/runners.py
@@ -53,10 +53,6 @@ class RunnerRunning(BaseRunnerStatus):
    pass


-class RunnerShuttingDown(BaseRunnerStatus):
-    pass
-
-
 class RunnerShutdown(BaseRunnerStatus):
    pass

@@ -74,7 +70,6 @@ RunnerStatus = (
    | RunnerWarmingUp
    | RunnerReady
    | RunnerRunning
-    | RunnerShuttingDown
    | RunnerShutdown
    | RunnerFailed
 )
--- a/src/exo/worker/engines/mlx/init.py
+++ b/src/exo/worker/engines/mlx/init.py
@@ -40,6 +40,4 @@ class TokenizerWrapper:
        messages_dicts: list[dict[str, Any]],
        tokenize: bool = False,
        add_generation_prompt: bool = True,
-        continue_final_message: bool = False,
-        tools: list[dict[str, Any]] | None = None,
    ) -> str: ...
--- a/src/exo/worker/engines/mlx/auto_parallel.py
+++ b/src/exo/worker/engines/mlx/auto_parallel.py
@@ -10,24 +10,18 @@ from mlx.nn.layers.distributed import (
    shard_linear,
    sum_gradients,
 )
+from mlx_lm.models.cache import (
+    _BaseCache,  # pyright: ignore[reportPrivateUsage]
+)
 from mlx_lm.models.deepseek_v3 import DeepseekV3MLP
 from mlx_lm.models.deepseek_v3 import Model as DeepseekV3Model
-from mlx_lm.models.deepseek_v32 import DeepseekV32MLP
-from mlx_lm.models.deepseek_v32 import Model as DeepseekV32Model
-from mlx_lm.models.glm4_moe import Model as Glm4MoeModel
-from mlx_lm.models.glm4_moe import MoE
-from mlx_lm.models.gpt_oss import GptOssMoeModel
-from mlx_lm.models.gpt_oss import Model as GptOssModel
 from mlx_lm.models.llama import Model as LlamaModel
-from mlx_lm.models.minimax import Model as MiniMaxModel
-from mlx_lm.models.ministral3 import Model as Ministral3Model
 from mlx_lm.models.qwen3_moe import Model as Qwen3MoeModel
 from mlx_lm.models.qwen3_moe import Qwen3MoeSparseMoeBlock
-from mlx_lm.models.qwen3_next import Model as Qwen3NextModel
-from mlx_lm.models.qwen3_next import Qwen3NextSparseMoeBlock

-from exo.shared.logging import logger
-from exo.shared.types.worker.shards import PipelineShardMetadata
+from exo.shared.types.worker.shards import (
+    PipelineShardMetadata,
+)


 class _LayerCallable(Protocol):
@@ -97,6 +91,8 @@ class PipelineLastLayer(CustomMlxLayer):
            x, *args, **kwargs
        ).arguments.get("cache", None)

+        assert cache is None or issubclass(type(cache), _BaseCache)  # type: ignore
+
        output: mx.array = self.original_layer(x, *args, **kwargs)

        if self.r != self.s - 1:
@@ -104,6 +100,7 @@ class PipelineLastLayer(CustomMlxLayer):
                output, (self.r + 1) % self.s, group=self.group
            )
            if cache is not None:
+                # This change happened upstream - check out mlx github somewhere??
                cache.keys = mx.depends(cache.keys, output)  # type: ignore[reportUnknownMemberType]

        output = mx.distributed.all_gather(output, group=self.group)[-output.shape[0] :]
@@ -135,6 +132,24 @@ def _get_layers(inner_model_instance: nn.Module) -> list[_LayerCallable]:
    return layers


+def _set_layers(model: nn.Module, layers: list[_LayerCallable]) -> None:
+    inner_model_instance = _inner_model(model)
+    if hasattr(inner_model_instance, "layers"):
+        inner_model_instance.layers = layers
+
+        # Update DeepSeek V3 specific parameters when layers are shrunk
+        if isinstance(model, DeepseekV3Model) and hasattr(
+            inner_model_instance, "num_layers"
+        ):
+            inner_model_instance.start_idx = 0
+            inner_model_instance.end_idx = len(layers)
+            inner_model_instance.num_layers = len(layers)
+    elif hasattr(inner_model_instance, "h"):
+        inner_model_instance.h = layers
+    else:
+        raise ValueError("Model must have either a 'layers' or 'h' attribute")
+
+
 def pipeline_auto_parallel(
    model: nn.Module,
    group: mx.distributed.Group,
@@ -150,7 +165,8 @@ def pipeline_auto_parallel(
    """
    inner_model_instance: nn.Module = _inner_model(model)

-    layers = _get_layers(inner_model_instance)
+    # Handle both model.layers and model.h cases
+    layers: list[_LayerCallable] = _get_layers(inner_model_instance)

    start_layer, end_layer = model_shard_meta.start_layer, model_shard_meta.end_layer
    device_rank, world_size = model_shard_meta.device_rank, model_shard_meta.world_size
@@ -164,17 +180,6 @@ def pipeline_auto_parallel(
        group=group,
    )

-    if isinstance(inner_model_instance, GptOssMoeModel):
-        inner_model_instance.layer_types = inner_model_instance.layer_types[  # type: ignore
-            start_layer:end_layer
-        ]
-        inner_model_instance.swa_idx = inner_model_instance.layer_types.index(  # type: ignore
-            "sliding_attention"
-        )
-        inner_model_instance.ga_idx = inner_model_instance.layer_types.index(  # type: ignore
-            "full_attention"
-        )
-
    _set_layers(model, layers)

    assert isinstance(layers, list), (
@@ -199,44 +204,18 @@ def tensor_auto_parallel(
        group=group,
    )

-    segments: int = 1
-
-    def _all_to_sharded(path: str, weight: mx.array):
-        if path.endswith("bias"):
-            logger.info(f"Sharding bias for {path} - all to sharded")
-            return weight.ndim - 1, segments
-        return max(weight.ndim - 2, 0), segments
-
    all_to_sharded_linear_in_place = partial(
        shard_inplace,
-        sharding=_all_to_sharded,  # type: ignore
+        sharding="all-to-sharded",
        group=group,
    )
-
-    n = group.size()
-
-    def _sharded_to_all(path: str, weight: mx.array):
-        if path.endswith("bias"):
-            logger.info(f"Sharding bias for {path} - sharded to all")
-            weight /= n
-            return None
-        return -1, segments
-
    sharded_to_all_linear_in_place = partial(
        shard_inplace,
-        sharding=_sharded_to_all,  # type: ignore
+        sharding="sharded-to-all",
        group=group,
    )

-    if hasattr(model, "shard"):
-        try:
-            model.shard(group)  # type: ignore
-            return model
-        except (AttributeError, TypeError, NameError):
-            pass
-
-    if isinstance(model, (LlamaModel, Ministral3Model)):
-        logger.warning("shouldn't be hit - upstream sharding exists")
+    if isinstance(model, LlamaModel):
        tensor_parallel_sharding_strategy = LlamaShardingStrategy(
            group,
            all_to_sharded_linear,
@@ -244,8 +223,7 @@ def tensor_auto_parallel(
            all_to_sharded_linear_in_place,
            sharded_to_all_linear_in_place,
        )
-    elif isinstance(model, (DeepseekV3Model, DeepseekV32Model)):
-        logger.warning("shouldn't be hit - upstream sharding exists")
+    elif isinstance(model, DeepseekV3Model):
        tensor_parallel_sharding_strategy = DeepSeekShardingStrategy(
            group,
            all_to_sharded_linear,
@@ -253,15 +231,7 @@ def tensor_auto_parallel(
            all_to_sharded_linear_in_place,
            sharded_to_all_linear_in_place,
        )
-    elif isinstance(model, MiniMaxModel):
-        tensor_parallel_sharding_strategy = MiniMaxShardingStrategy(
-            group,
-            all_to_sharded_linear,
-            sharded_to_all_linear,
-            all_to_sharded_linear_in_place,
-            sharded_to_all_linear_in_place,
-        )
-    elif isinstance(model, (Qwen3MoeModel, Glm4MoeModel, Qwen3NextModel)):
+    elif isinstance(model, Qwen3MoeModel):
        tensor_parallel_sharding_strategy = QwenShardingStrategy(
            group,
            all_to_sharded_linear,
@@ -269,15 +239,6 @@ def tensor_auto_parallel(
            all_to_sharded_linear_in_place,
            sharded_to_all_linear_in_place,
        )
-    elif isinstance(model, GptOssModel):
-        tensor_parallel_sharding_strategy = GptOssShardingStrategy(
-            group,
-            all_to_sharded_linear,
-            sharded_to_all_linear,
-            all_to_sharded_linear_in_place,
-            sharded_to_all_linear_in_place,
-        )
-
    else:
        raise ValueError(f"Unsupported model type: {type(model)}")

@@ -323,38 +284,13 @@ class LlamaShardingStrategy(TensorParallelShardingStrategy):
        return model


-def _set_layers(model: nn.Module, layers: list[_LayerCallable]) -> None:
-    inner_model_instance = _inner_model(model)
-    if hasattr(inner_model_instance, "layers"):
-        inner_model_instance.layers = layers
-
-        # Update DeepSeek V3 specific parameters when layers are shrunk
-        if isinstance(
-            model, (DeepseekV3Model, DeepseekV32Model, Glm4MoeModel)
-        ) and hasattr(inner_model_instance, "num_layers"):
-            logger.info(
-                f"Setting num_layers to {len(layers)} for model {model.model.__class__.__name__}"
-            )
-            inner_model_instance.start_idx = 0
-            inner_model_instance.end_idx = len(layers)
-            inner_model_instance.num_layers = len(layers)
-        elif isinstance(model, Qwen3MoeModel):
-            logger.info(
-                f"Setting num_hidden_layers to {len(layers)} for model {model.model.__class__.__name__}"
-            )
-            inner_model_instance.num_hidden_layers = len(layers)
-    elif hasattr(inner_model_instance, "h"):
-        inner_model_instance.h = layers
-    else:
-        raise ValueError("Model must have either a 'layers' or 'h' attribute")
-
-
 class DeepSeekShardingStrategy(TensorParallelShardingStrategy):
    def shard_model(self, model: nn.Module) -> nn.Module:
        model = cast(DeepseekV3Model, model)
        for layer in model.layers:
            # Shard the self attention
-            if layer.self_attn.q_lora_rank is None:
+            if layer.self_attn.q_lora_rank is None:  # pyright: ignore[reportUnnecessaryComparison]
+                # Unfortunately, q_lora_rank can be None despite typing hints.
                layer.self_attn.q_proj = self.all_to_sharded_linear(
                    layer.self_attn.q_proj
                )
@@ -369,7 +305,7 @@ class DeepSeekShardingStrategy(TensorParallelShardingStrategy):
            layer.self_attn.num_heads //= self.N

            # Shard the MLP
-            if isinstance(layer.mlp, (DeepseekV3MLP, DeepseekV32MLP)):
+            if isinstance(layer.mlp, DeepseekV3MLP):
                layer.mlp.gate_proj = self.all_to_sharded_linear(layer.mlp.gate_proj)
                layer.mlp.down_proj = self.sharded_to_all_linear(layer.mlp.down_proj)
                layer.mlp.up_proj = self.all_to_sharded_linear(layer.mlp.up_proj)
@@ -403,35 +339,6 @@ class ShardedDeepseekV3MoE(CustomMlxLayer):
        return y


-class MiniMaxShardingStrategy(TensorParallelShardingStrategy):
-    def shard_model(self, model: nn.Module) -> nn.Module:
-        model = cast(MiniMaxModel, model)
-        for layer in model.layers:
-            # Shard the self attention
-            layer.self_attn.q_proj = self.all_to_sharded_linear(layer.self_attn.q_proj)
-            layer.self_attn.k_proj = self.all_to_sharded_linear(layer.self_attn.k_proj)
-            layer.self_attn.v_proj = self.all_to_sharded_linear(layer.self_attn.v_proj)
-            layer.self_attn.o_proj = self.sharded_to_all_linear(layer.self_attn.o_proj)
-            layer.self_attn.num_attention_heads //= self.N
-            layer.self_attn.num_key_value_heads //= self.N
-
-            # Shard the MoE. Shard in place since the MoE should be responsible
-            # for aggregating the results.
-            self.all_to_sharded_linear_in_place(
-                layer.block_sparse_moe.switch_mlp.gate_proj
-            )
-            self.sharded_to_all_linear_in_place(
-                layer.block_sparse_moe.switch_mlp.down_proj
-            )
-            self.all_to_sharded_linear_in_place(
-                layer.block_sparse_moe.switch_mlp.up_proj
-            )
-            layer.block_sparse_moe = ShardedQwenMoE(layer.block_sparse_moe)  # pyright: ignore[reportAttributeAccessIssue, reportArgumentType]
-            layer.block_sparse_moe.sharding_group = self.group
-
-        return model
-
-
 class QwenShardingStrategy(TensorParallelShardingStrategy):
    def shard_model(self, model: nn.Module) -> nn.Module:
        model = cast(Qwen3MoeModel, model)
@@ -446,13 +353,11 @@ class QwenShardingStrategy(TensorParallelShardingStrategy):

            # Shard the MoE. Shard in place since the MoE should be responsible
            # for aggregating the results.
-            if isinstance(
-                layer.mlp, (Qwen3MoeSparseMoeBlock, MoE, Qwen3NextSparseMoeBlock)
-            ):
+            if isinstance(layer.mlp, Qwen3MoeSparseMoeBlock):
                self.all_to_sharded_linear_in_place(layer.mlp.switch_mlp.gate_proj)
                self.sharded_to_all_linear_in_place(layer.mlp.switch_mlp.down_proj)
                self.all_to_sharded_linear_in_place(layer.mlp.switch_mlp.up_proj)
-                layer.mlp = ShardedQwenMoE(layer.mlp)  # pyright: ignore[reportAttributeAccessIssue, reportArgumentType]
+                layer.mlp = ShardedQwenMoE(layer.mlp)  # type: ignore
                layer.mlp.sharding_group = self.group

            # Shard the MLP
@@ -476,50 +381,3 @@ class ShardedQwenMoE(CustomMlxLayer):
        if self.sharding_group is not None:
            y = mx.distributed.all_sum(y, group=self.sharding_group)
        return y
-
-
-class GptOssShardingStrategy(TensorParallelShardingStrategy):
-    def shard_model(self, model: nn.Module) -> nn.Module:
-        model = cast(GptOssMoeModel, model)
-
-        for layer in model.layers:
-            layer.self_attn.q_proj = self.all_to_sharded_linear(layer.self_attn.q_proj)
-            layer.self_attn.k_proj = self.all_to_sharded_linear(layer.self_attn.k_proj)
-            layer.self_attn.v_proj = self.all_to_sharded_linear(layer.self_attn.v_proj)
-            layer.self_attn.o_proj = self.sharded_to_all_linear(layer.self_attn.o_proj)
-
-            layer.self_attn.num_attention_heads //= self.N
-            layer.self_attn.num_key_value_heads //= self.N
-            layer.self_attn.num_key_value_groups = (
-                layer.self_attn.num_attention_heads
-                // layer.self_attn.num_key_value_heads
-            )
-
-            layer.self_attn.sinks = layer.self_attn.sinks[
-                layer.self_attn.num_attention_heads
-                * self.group.rank() : layer.self_attn.num_attention_heads
-                * (self.group.rank() + 1)
-            ]
-
-            self.all_to_sharded_linear_in_place(layer.mlp.experts.gate_proj)
-            self.sharded_to_all_linear_in_place(layer.mlp.experts.down_proj)
-            self.all_to_sharded_linear_in_place(layer.mlp.experts.up_proj)
-
-            layer.mlp = ShardedGptOssMoE(layer.mlp)  # type: ignore
-            layer.mlp.sharding_group = self.group
-
-        return model
-
-
-class ShardedGptOssMoE(CustomMlxLayer):
-    def __init__(self, layer: nn.Module):
-        super().__init__(layer)
-        self.sharding_group: mx.distributed.Group | None = None
-
-    def __call__(self, x: mx.array) -> mx.array:
-        if self.sharding_group is not None:
-            x = sum_gradients(self.sharding_group)(x)
-        y = self.original_layer(x)
-        if self.sharding_group is not None:
-            y = mx.distributed.all_sum(y, group=self.sharding_group)
-        return y
--- a/src/exo/worker/engines/mlx/generator/generate.py
+++ b/src/exo/worker/engines/mlx/generator/generate.py
@@ -3,17 +3,11 @@ from typing import Any, Callable, Generator, cast, get_args
 import mlx.core as mx
 from mlx_lm import stream_generate
 from mlx_lm.models.cache import KVCache
-from mlx_lm.sample_utils import make_sampler
 from mlx_lm.tokenizer_utils import TokenizerWrapper

 # from exo.engines.mlx.cache import KVPrefixCache
-from exo.shared.types.api import (
-    FinishReason,
-    GenerationStats,
-    TopLogprobItem,
-)
-from exo.shared.types.memory import Memory
-from exo.shared.types.openai_responses import ResponsesRequest
+from exo.shared.types.api import ChatCompletionMessage, FinishReason
+from exo.shared.types.tasks import ChatCompletionTaskParams
 from exo.shared.types.worker.runner_response import (
    GenerationResponse,
 )
@@ -47,14 +41,20 @@ def maybe_quantize_kv_cache(
 def warmup_inference(
    model: Model,
    tokenizer: TokenizerWrapper,
+    sampler: Callable[[mx.array], mx.array],
 ) -> int:
    content = "Prompt to warm up the inference engine. Repeat this."

    warmup_prompt = apply_chat_template(
        tokenizer=tokenizer,
-        task_params=ResponsesRequest(
+        chat_task_data=ChatCompletionTaskParams(
            model="",
-            input=content,
+            messages=[
+                ChatCompletionMessage(
+                    role="user",
+                    content=content,
+                )
+            ],
        ),
    )

@@ -64,9 +64,6 @@ def warmup_inference(
        model=model,
    )

-    # Use a default sampler for warmup
-    sampler = make_sampler(temp=0.7)
-
    logger.info("Generating warmup tokens")
    for _r in stream_generate(
        model=model,
@@ -75,7 +72,7 @@ def warmup_inference(
        max_tokens=50,
        sampler=sampler,
        prompt_cache=cache,
-        prefill_step_size=2048,
+        prefill_step_size=65536,
        kv_group_size=KV_GROUP_SIZE,
        kv_bits=KV_BITS,
    ):
@@ -83,159 +80,54 @@ def warmup_inference(
        tokens_generated += 1

    logger.info("Generated ALL warmup tokens")
-
-    # TODO: Do we want an mx_barrier?
-    #  At least this version is actively incorrect, as it should use mx_barrier(group)
    mx_barrier()

    return tokens_generated


-def ban_token_ids(token_ids: list[int]) -> Callable[[mx.array, mx.array], mx.array]:
-    token_ids = [int(t) for t in token_ids]
-
-    def proc(_history: mx.array, logits: mx.array) -> mx.array:
-        for tid in token_ids:
-            logits[..., tid] = -1e9
-        return logits
-
-    return proc
-
-
-def eos_ids_from_tokenizer(tokenizer: TokenizerWrapper) -> list[int]:
-    eos: list[int] | None = getattr(tokenizer, "eos_token_ids", None)
-    if eos is None:
-        return []
-    return eos
-
-
 def mlx_generate(
    model: Model,
    tokenizer: TokenizerWrapper,
-    task: ResponsesRequest,
-    is_bench: bool = False,
+    sampler: Callable[[mx.array], mx.array],
+    task: ChatCompletionTaskParams,
 ) -> Generator[GenerationResponse]:
-    # Ensure that generation stats only contains peak memory for this generation
-    mx.reset_peak_memory()
-
    # Currently we support chat-completion tasks only.
    logger.info(f"task_params: {task}")

-    if task.seed is not None:
-        mx.random.seed(task.seed)
-
    prompt = apply_chat_template(
        tokenizer=tokenizer,
-        task_params=task,
+        chat_task_data=task,
    )

    caches = make_kv_cache(model=model)

-    logits_processors: list[Callable[[mx.array, mx.array], mx.array]] = []
-    if is_bench:
-        # Only sample length eos tokens
-        eos_ids = eos_ids_from_tokenizer(tokenizer)
-        logits_processors = [ban_token_ids(eos_ids)]
-
-    sampler = make_sampler(
-        temp=task.temperature if task.temperature is not None else 0.7,
-        top_p=task.top_p if task.top_p is not None else 1.0,
-        top_k=task.top_k if task.top_k is not None else 0,
-    )
-
-    # Normalize stop sequences to a list
-    stop_sequences: list[str] = (
-        ([task.stop] if isinstance(task.stop, str) else task.stop)
-        if task.stop is not None
-        else []
-    )
-    max_stop_len = max((len(s) for s in stop_sequences), default=0)
-
-    max_tokens = task.max_output_tokens or MAX_TOKENS
-    accumulated_text = ""
-
+    max_tokens = task.max_tokens or MAX_TOKENS
    for out in stream_generate(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        max_tokens=max_tokens,
        sampler=sampler,
-        logits_processors=logits_processors,
        prompt_cache=caches,
-        # TODO: Dynamically change prefill step size to be the maximum possible without timing out.
-        prefill_step_size=2048,
+        prefill_step_size=65536,
        kv_group_size=KV_GROUP_SIZE,
        kv_bits=KV_BITS,
-        return_logprob=True,
-        return_top_logprobs=5,
    ):
        logger.info(out.text)
-        accumulated_text += out.text
-
-        # Check for stop sequences
-        text = out.text
-        finish_reason: FinishReason | None = cast(
-            FinishReason | None, out.finish_reason
-        )
-        stop_matched = False
-
-        if stop_sequences:
-            for stop_seq in stop_sequences:
-                if stop_seq in accumulated_text:
-                    # Trim text to just before the stop sequence
-                    stop_index = accumulated_text.find(stop_seq)
-                    text_before_stop = accumulated_text[:stop_index]
-                    chunk_start = len(accumulated_text) - len(out.text)
-                    text = text_before_stop[chunk_start:]
-                    finish_reason = "stop"
-                    stop_matched = True
-                    break
-
-        is_done = finish_reason is not None
-        stats: GenerationStats | None = None
-        if is_done:
-            stats = GenerationStats(
-                prompt_tps=float(out.prompt_tps),
-                generation_tps=float(out.generation_tps),
-                prompt_tokens=int(out.prompt_tokens),
-                generation_tokens=int(out.generation_tokens),
-                peak_memory_usage=Memory.from_gb(out.peak_memory),
+        if out.finish_reason is not None and out.finish_reason not in get_args(
+            FinishReason
+        ):
+            # We don't throw here as this failure case is really not all that bad
+            # Just log the error and move on
+            logger.warning(
+                f"Model generated unexpected finish_reason: {out.finish_reason}"
            )
-            if not stop_matched and out.finish_reason not in get_args(FinishReason):
-                logger.warning(
-                    f"Model generated unexpected finish_reason: {out.finish_reason}"
-                )
-
-        # Extract logprobs if available
-        logprob: float | None = getattr(out, "logprob", None)
-        top_logprobs_raw: list[tuple[int, float]] | None = getattr(
-            out, "top_logprobs", None
-        )
-
-        top_logprobs: list[TopLogprobItem] | None = None
-        if top_logprobs_raw is not None:
-            top_logprobs = [
-                TopLogprobItem(
-                    token=text if i == 0 else tokenizer.decode([tok_id]),
-                    logprob=float(lp),
-                )
-                for i, (tok_id, lp) in enumerate(top_logprobs_raw)
-            ]

        yield GenerationResponse(
-            text=text,
+            text=out.text,
            token=out.token,
-            logprob=logprob,
-            top_logprobs=top_logprobs,
-            finish_reason=finish_reason,
-            stats=stats,
+            finish_reason=cast(FinishReason | None, out.finish_reason),
        )

-        if is_done:
+        if out.finish_reason is not None:
            break
-
-        # Limit accumulated_text to what's needed for stop sequence detection
-        if max_stop_len > 0 and len(accumulated_text) > max_stop_len:
-            accumulated_text = accumulated_text[-max_stop_len:]
-
-        # TODO: Do we want an mx_barrier?
--- a/src/exo/worker/engines/mlx/utils_mlx.py
+++ b/src/exo/worker/engines/mlx/utils_mlx.py
@@ -1,28 +1,13 @@
 import json
 import os
 import resource
-import sys
-import threading
 import time
-from collections.abc import Callable
 from pathlib import Path
-from typing import Any, cast
-
-# Monkey-patch for transformers 5.x compatibility
-# Kimi's tokenization_kimi.py imports bytes_to_unicode from the old location
-# which was moved in transformers 5.0.0rc2
-try:
-    import transformers.models.gpt2.tokenization_gpt2 as gpt2_tokenization
-    from transformers.convert_slow_tokenizer import bytes_to_unicode
-
-    if not hasattr(gpt2_tokenization, "bytes_to_unicode"):
-        gpt2_tokenization.bytes_to_unicode = bytes_to_unicode  # type: ignore[attr-defined]
-except ImportError:
-    pass  # transformers < 5.0 or bytes_to_unicode not available
+from typing import Any, Callable, cast

 from mlx_lm.models.cache import KVCache, QuantizedKVCache, RotatingKVCache
 from mlx_lm.models.deepseek_v3 import DeepseekV3Model
-from mlx_lm.models.gpt_oss import Model as GptOssModel
+from mlx_lm.sample_utils import make_sampler
 from mlx_lm.tokenizer_utils import TokenizerWrapper

 from exo.worker.engines.mlx.constants import (
@@ -34,7 +19,7 @@ from exo.worker.engines.mlx.constants import (
 try:
    from mlx_lm.tokenizer_utils import load_tokenizer
 except ImportError:
-    from mlx_lm.tokenizer_utils import load as load_tokenizer
+    from mlx_lm.tokenizer_utils import load as load_tokenizer  # type: ignore
 import contextlib

 import mlx.core as mx
@@ -42,9 +27,10 @@ import mlx.nn as nn
 from mlx_lm.utils import load_model
 from pydantic import RootModel

+from exo.shared.types.api import ChatCompletionMessageText
 from exo.shared.types.common import Host
 from exo.shared.types.memory import Memory
-from exo.shared.types.openai_responses import ResponsesRequest
+from exo.shared.types.tasks import ChatCompletionTaskParams
 from exo.shared.types.worker.instances import (
    BoundInstance,
    MlxJacclInstance,
@@ -83,45 +69,6 @@ def get_weights_size(model_shard_meta: ShardMetadata) -> Memory:
    )


-class ModelLoadingTimeoutError(Exception):
-    pass
-
-
-TimeoutCallback = Callable[[], None]
-
-
-def eval_with_timeout(
-    mlx_item: Any,  # pyright: ignore[reportAny]
-    timeout_seconds: float = 60.0,
-    on_timeout: TimeoutCallback | None = None,
-) -> None:
-    """Evaluate MLX item with a hard timeout.
-
-    If on_timeout callback is provided, it will be called before terminating
-    the process. This allows the runner to send a failure event before exit.
-    """
-    completed = threading.Event()
-
-    def watchdog() -> None:
-        if not completed.wait(timeout=timeout_seconds):
-            logger.error(
-                f"mlx_item evaluation timed out after {timeout_seconds:.0f}s. "
-                "This may indicate an issue with FAST_SYNCH and tensor parallel sharding. "
-                "Terminating process."
-            )
-            if on_timeout is not None:
-                on_timeout()
-            os._exit(1)
-
-    watchdog_thread = threading.Thread(target=watchdog, daemon=True)
-    watchdog_thread.start()
-
-    try:
-        mx.eval(mlx_item)  # pyright: ignore[reportAny]
-    finally:
-        completed.set()
-
-
 def mx_barrier(group: Group | None = None):
    mx.eval(
        mx.distributed.all_sum(
@@ -228,10 +175,12 @@ def initialize_mlx(


 def load_mlx_items(
-    bound_instance: BoundInstance,
-    group: Group | None,
-    on_timeout: TimeoutCallback | None = None,
-) -> tuple[Model, TokenizerWrapper]:
+    bound_instance: BoundInstance, group: Group | None
+) -> tuple[Model, TokenizerWrapper, Callable[[mx.array], mx.array]]:
+    # TODO: pass temperature
+    sampler: Callable[[mx.array], mx.array] = make_sampler(temp=0.7)
+    logger.info("Created a sampler")
+
    if group is None:
        logger.info(f"Single device used for {bound_instance.instance}")
        model_path = build_model_path(bound_instance.bound_shard.model_meta.model_id)
@@ -244,9 +193,7 @@ def load_mlx_items(
    else:
        logger.info("Starting distributed init")
        start_time = time.perf_counter()
-        model, tokenizer = shard_and_load(
-            bound_instance.bound_shard, group=group, on_timeout=on_timeout
-        )
+        model, tokenizer = shard_and_load(bound_instance.bound_shard, group=group)
        end_time = time.perf_counter()
        logger.info(
            f"Time taken to shard and load model: {(end_time - start_time):.2f}s"
@@ -254,13 +201,12 @@ def load_mlx_items(

    set_wired_limit_for_model(get_weights_size(bound_instance.bound_shard))

-    return cast(Model, model), tokenizer
+    return cast(Model, model), tokenizer, sampler


 def shard_and_load(
    shard_metadata: ShardMetadata,
    group: Group,
-    on_timeout: TimeoutCallback | None = None,
 ) -> tuple[nn.Module, TokenizerWrapper]:
    model_path = build_model_path(shard_metadata.model_meta.model_id)

@@ -297,15 +243,7 @@ def shard_and_load(
            logger.info(f"loading model from {model_path} with pipeline parallelism")
            model = pipeline_auto_parallel(model, group, shard_metadata)

-    # Estimate timeout based on model size
-    base_timeout = float(os.environ.get("EXO_MODEL_LOAD_TIMEOUT", "60"))
-    model_size_gb = get_weights_size(shard_metadata).in_bytes / (1024**3)
-    timeout_seconds = base_timeout + model_size_gb / 5
-    logger.info(
-        f"Evaluating model parameters with timeout of {timeout_seconds:.0f}s "
-        f"(model size: {model_size_gb:.1f}GB)"
-    )
-    eval_with_timeout(model.parameters(), timeout_seconds, on_timeout)
+    mx.eval(model.parameters())

    # TODO: Do we need this?
    mx.eval(model)
@@ -319,127 +257,62 @@ def shard_and_load(
    return model, tokenizer


-def get_tokenizer(model_path: Path, shard_metadata: ShardMetadata) -> TokenizerWrapper:
-    """Load tokenizer for a model shard. Delegates to load_tokenizer_for_model_id."""
-    return load_tokenizer_for_model_id(shard_metadata.model_meta.model_id, model_path)
+def get_tokenizer(model_path: Path, shard_metadata: ShardMetadata):
+    # TODO: Let's move away from this custom logic to mlx_lm.load()
+    if "kimi-k2" in shard_metadata.model_meta.model_id.lower():
+        eos_token_ids = [163586]

+    elif "glm" in shard_metadata.model_meta.model_id.lower():
+        eos_token_ids = [151336, 151329, 151338]

-def get_eos_token_ids_for_model(model_id: str) -> list[int] | None:
-    """
-    Get the EOS token IDs for a model based on its ID.
+    else:
+        eos_token_ids = None

-    Some models require explicit EOS token configuration that isn't in their
-    tokenizer config. This function returns the known EOS token IDs for such models.
-
-    Args:
-        model_id: The HuggingFace model ID
-
-    Returns:
-        List of EOS token IDs, or None if the model uses standard tokenizer config
-    """
-    model_id_lower = model_id.lower()
-    if "kimi-k2" in model_id_lower:
-        return [163586]
-    elif "glm" in model_id_lower:
-        return [151336, 151329, 151338]
-    return None
-
-
-def load_tokenizer_for_model_id(model_id: str, model_path: Path) -> TokenizerWrapper:
-    """
-    Load tokenizer for a model given its ID and local path.
-
-    This is the core tokenizer loading logic, handling special cases for different
-    model families (Kimi, GLM, etc.) and transformers 5.x compatibility.
-
-    Args:
-        model_id: The HuggingFace model ID (e.g., "moonshotai/Kimi-K2-Instruct")
-        model_path: Local path where the model/tokenizer files are stored
-
-    Returns:
-        TokenizerWrapper instance configured for the model
-    """
-    model_id_lower = model_id.lower()
-    eos_token_ids = get_eos_token_ids_for_model(model_id)
-
-    # Kimi uses a custom TikTokenTokenizer that transformers 5.x can't load via AutoTokenizer
-    if "kimi-k2" in model_id_lower:
-        sys.path.insert(0, str(model_path))
-        from tokenization_kimi import TikTokenTokenizer  # type: ignore[import-not-found]  # noqa: I001
-
-        hf_tokenizer: Any = TikTokenTokenizer.from_pretrained(model_path)  # pyright: ignore[reportUnknownVariableType,reportUnknownMemberType]
-
-        # Patch encode to use internal tiktoken model directly
-        # transformers 5.x has a bug in the encode->pad path for slow tokenizers
-        def _patched_encode(text: str, **_kwargs: object) -> list[int]:
-            # Pass allowed_special="all" to handle special tokens like <|im_user|>
-            return list(hf_tokenizer.model.encode(text, allowed_special="all"))  # pyright: ignore[reportUnknownMemberType,reportUnknownArgumentType]
-
-        hf_tokenizer.encode = _patched_encode
-        return TokenizerWrapper(hf_tokenizer, eos_token_ids=eos_token_ids)
-
-    tokenizer = load_tokenizer(
-        model_path,
-        tokenizer_config_extra={"trust_remote_code": TRUST_REMOTE_CODE},
-        eos_token_ids=eos_token_ids,
+    tokenizer = cast(
+        TokenizerWrapper,
+        load_tokenizer(
+            model_path,
+            tokenizer_config_extra={"trust_remote_code": TRUST_REMOTE_CODE},
+            eos_token_ids=eos_token_ids,
+        ),
    )
+    assert isinstance(tokenizer, TokenizerWrapper)

    return tokenizer


 def apply_chat_template(
    tokenizer: TokenizerWrapper,
-    task_params: ResponsesRequest,
+    chat_task_data: ChatCompletionTaskParams,
 ) -> str:
-    """Convert ResponsesRequest to a chat template prompt.
+    # Now we can properly access the messages
+    messages = chat_task_data.messages

-    Converts the internal format (input + instructions) to a messages list
-    that can be processed by the tokenizer's chat template.
-    """
    formatted_messages: list[dict[str, Any]] = []
-
-    # Add system message (instructions) if present
-    if task_params.instructions:
-        formatted_messages.append(
-            {"role": "system", "content": task_params.instructions}
-        )
-
-    # Convert input to messages
-    if isinstance(task_params.input, str):
-        # Simple string input becomes a single user message
-        formatted_messages.append({"role": "user", "content": task_params.input})
-    else:
-        # List of InputMessage
-        for msg in task_params.input:
-            if not msg.content:
-                logger.warning("Received message with empty content, skipping")
+    for _, message in enumerate(messages):
+        if isinstance(message.content, ChatCompletionMessageText):
+            message.content = message.content.text
+        if isinstance(message.content, list):
+            if len(message.content) != 1:
+                logger.warning("Received malformed prompt")
                continue
-            formatted_messages.append({"role": msg.role, "content": msg.content})

-    # Use continue_final_message when continuing from prefix (e.g., regenerate from token)
-    # This keeps the final assistant message open without EOS tokens
-    # Note: explicitly set add_generation_prompt=False when using continue_final_message
-    # because some tokenizers (e.g., Kimi) default add_generation_prompt=True
-    prompt: str
-    if task_params.continue_from_prefix:
-        prompt = tokenizer.apply_chat_template(
-            formatted_messages,
-            tokenize=False,
-            continue_final_message=True,
-            add_generation_prompt=False,
-            tools=task_params.tools,
-        )
-    else:
-        prompt = tokenizer.apply_chat_template(
-            formatted_messages,
-            tokenize=False,
-            add_generation_prompt=True,
-            tools=task_params.tools,
+            message.content = message.content[0].text
+        if message.content is None and message.thinking is None:
+            continue
+
+        # Null values are not valid when applying templates in tokenizer
+        formatted_messages.append(
+            {k: v for k, v in message.model_dump().items() if v is not None}  # type: ignore
        )

-    logger.info(prompt)
+    prompt: str = tokenizer.apply_chat_template(  # type: ignore
+        formatted_messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )

-    return prompt
+    return prompt  # type: ignore


 class NullKVCache(KVCache):
@@ -470,11 +343,6 @@ def make_kv_cache(
 ) -> list[KVCache | RotatingKVCache | QuantizedKVCache]:
    assert hasattr(model, "layers")

-    # TODO: Do this for all models
-    if hasattr(model, "make_cache") and isinstance(model, GptOssModel):
-        logger.info("Using MLX LM's make cache")
-        return model.make_cache()  # type: ignore
-
    if max_kv_size is None:
        if KV_CACHE_BITS is None:
            logger.info("Using default KV cache")
@@ -529,13 +397,3 @@ def set_wired_limit_for_model(model_size: Memory):
        )
    mx.set_wired_limit(max_rec_size)
    logger.info(f"Wired limit set to {max_rec_size}.")
-
-
-def mlx_cleanup(
-    model: Model | None, tokenizer: TokenizerWrapper | None, group: Group | None
-) -> None:
-    del model, tokenizer, group
-    mx.clear_cache()
-    import gc
-
-    gc.collect()
--- a/src/exo/worker/main.py
+++ b/src/exo/worker/main.py
@@ -217,9 +217,7 @@ class Worker:
                    )
                    if initial_progress.status == "complete":
                        progress = DownloadCompleted(
-                            shard_metadata=shard,
-                            node_id=self.node_id,
-                            total_bytes=initial_progress.total_bytes,
+                            shard_metadata=shard, node_id=self.node_id
                        )
                        self.download_status[shard.model_meta.model_id] = progress
                        await self.event_sender.send(
@@ -366,11 +364,7 @@ class Worker:
            nonlocal self
            nonlocal last_progress_time
            if progress.status == "complete":
-                status = DownloadCompleted(
-                    shard_metadata=shard,
-                    node_id=self.node_id,
-                    total_bytes=progress.total_bytes,
-                )
+                status = DownloadCompleted(shard_metadata=shard, node_id=self.node_id)
                self.download_status[shard.model_meta.model_id] = status
                # Footgun!
                self.event_sender.send_nowait(
@@ -463,9 +457,7 @@ class Worker:
                ) in self.shard_downloader.get_shard_download_status():
                    if progress.status == "complete":
                        status = DownloadCompleted(
-                            node_id=self.node_id,
-                            shard_metadata=progress.shard,
-                            total_bytes=progress.total_bytes,
+                            node_id=self.node_id, shard_metadata=progress.shard
                        )
                    elif progress.status in ["in_progress", "not_started"]:
                        if progress.downloaded_bytes_this_session.in_bytes == 0:
--- a/src/exo/worker/plan.py
+++ b/src/exo/worker/plan.py
@@ -274,12 +274,6 @@ def _pending_tasks(
            if task.instance_id != runner.bound_instance.instance.instance_id:
                continue

-            # I have a design point here; this is a state race in disguise as the task status doesn't get updated to completed fast enough
-            # however, realistically the task status should be set to completed by the LAST runner, so this is a true race
-            # the actual solution is somewhat deeper than this bypass - TODO!
-            if task.task_id in runner.completed:
-                continue
-
            # TODO: Check ordering aligns with MLX distributeds expectations.

            if isinstance(runner.status, RunnerReady) and all(
--- a/src/exo/worker/runner/bootstrap.py
+++ b/src/exo/worker/runner/bootstrap.py
@@ -6,7 +6,7 @@ from exo.shared.types.events import Event, RunnerStatusUpdated
 from exo.shared.types.tasks import Task
 from exo.shared.types.worker.instances import BoundInstance, MlxJacclInstance
 from exo.shared.types.worker.runners import RunnerFailed
-from exo.utils.channels import ClosedResourceError, MpReceiver, MpSender
+from exo.utils.channels import MpReceiver, MpSender

 logger: "loguru.Logger" = loguru.logger

@@ -17,30 +17,20 @@ def entrypoint(
    task_receiver: MpReceiver[Task],
    _logger: "loguru.Logger",
 ) -> None:
-    fast_synch_override = os.environ.get("EXO_FAST_SYNCH")
-    if fast_synch_override == "on" or (
-        fast_synch_override != "off"
-        and (
-            isinstance(bound_instance.instance, MlxJacclInstance)
-            and len(bound_instance.instance.ibv_devices) >= 2
-        )
+    if (
+        isinstance(bound_instance.instance, MlxJacclInstance)
+        and len(bound_instance.instance.ibv_devices) >= 2
    ):
        os.environ["MLX_METAL_FAST_SYNCH"] = "1"
-    else:
-        os.environ["MLX_METAL_FAST_SYNCH"] = "0"

    global logger
    logger = _logger

-    logger.info(f"Fast synch flag: {os.environ['MLX_METAL_FAST_SYNCH']}")
-
    # Import main after setting global logger - this lets us just import logger from this module
    try:
        from exo.worker.runner.runner import main

        main(bound_instance, event_sender, task_receiver)
-    except ClosedResourceError:
-        logger.warning("Runner communication closed unexpectedly")
    except Exception as e:
        logger.opt(exception=e).warning(
            f"Runner {bound_instance.bound_runner_id} crashed with critical exception {e}"
@@ -52,10 +42,8 @@ def entrypoint(
            )
        )
    finally:
-        try:
-            event_sender.close()
-            task_receiver.close()
-        finally:
-            event_sender.join()
-            task_receiver.join()
-            logger.info("bye from the runner")
+        event_sender.close()
+        task_receiver.close()
+        event_sender.join()
+        task_receiver.join()
+        logger.info("bye from the runner")
--- a/src/exo/worker/runner/runner.py
+++ b/src/exo/worker/runner/runner.py
@@ -1,20 +1,7 @@
 import time
-from collections.abc import Generator
-from contextlib import contextmanager
-from functools import cache
-from typing import cast
-
-import mlx.core as mx
-from mlx_lm.models.gpt_oss import Model as GptOssModel
-from openai_harmony import (  # pyright: ignore[reportMissingTypeStubs]
-    HarmonyEncodingName,
-    Role,
-    StreamableParser,
-    load_harmony_encoding,
-)

+from exo.shared.types.api import ChatCompletionMessageText
 from exo.shared.types.chunks import TokenChunk
-from exo.shared.types.common import CommandId
 from exo.shared.types.events import (
    ChunkGenerated,
    Event,
@@ -22,8 +9,6 @@ from exo.shared.types.events import (
    TaskAcknowledged,
    TaskStatusUpdated,
 )
-from exo.shared.types.models import ModelId
-from exo.shared.types.openai_responses import ResponsesRequest
 from exo.shared.types.tasks import (
    ChatCompletion,
    ConnectToGroup,
@@ -47,12 +32,10 @@ from exo.shared.types.worker.runners import (
    RunnerReady,
    RunnerRunning,
    RunnerShutdown,
-    RunnerShuttingDown,
    RunnerStatus,
    RunnerWarmingUp,
 )
-from exo.utils.channels import MpReceiver, MpSender
-from exo.worker.engines.mlx import Model
+from exo.utils.channels import ClosedResourceError, MpReceiver, MpSender
 from exo.worker.engines.mlx.generator.generate import mlx_generate, warmup_inference
 from exo.worker.engines.mlx.utils_mlx import (
    initialize_mlx,
@@ -62,33 +45,6 @@ from exo.worker.engines.mlx.utils_mlx import (
 from exo.worker.runner.bootstrap import logger


-@contextmanager
-def send_error_chunk_on_exception(
-    event_sender: MpSender[Event],
-    command_id: CommandId,
-    model_id: ModelId,
-    device_rank: int,
-):
-    try:
-        yield
-    except Exception as e:
-        logger.error(e)
-        if device_rank == 0:
-            event_sender.send(
-                ChunkGenerated(
-                    command_id=command_id,
-                    chunk=TokenChunk(
-                        idx=0,
-                        model=model_id,
-                        text="",
-                        token_id=0,
-                        finish_reason="error",
-                        error_message=str(e),
-                    ),
-                )
-            )
-
-
 def main(
    bound_instance: BoundInstance,
    event_sender: MpSender[Event],
@@ -99,132 +55,117 @@ def main(
        bound_instance.bound_runner_id,
        bound_instance.bound_shard,
    )
-    logger.info("hello from the runner")
-    if getattr(shard_metadata, "immediate_exception", False):
-        raise Exception("Fake exception - runner failed to spin up.")
-    if timeout := getattr(shard_metadata, "should_timeout", 0):
-        time.sleep(timeout)
+    try:
+        logger.info("hello from the runner")
+        if getattr(shard_metadata, "immediate_exception", False):
+            raise Exception("Fake exception - runner failed to spin up.")
+        if timeout := getattr(shard_metadata, "should_timeout", 0):
+            time.sleep(timeout)

-    setup_start_time = time.time()
+        setup_start_time = time.time()

-    model = None
-    tokenizer = None
-    group = None
+        model = None
+        tokenizer = None
+        sampler = None
+        group = None

-    current_status: RunnerStatus = RunnerIdle()
-    logger.info("runner created")
-    event_sender.send(
-        RunnerStatusUpdated(runner_id=runner_id, runner_status=current_status)
-    )
-    with task_receiver as tasks:
-        for task in tasks:
-            event_sender.send(
-                TaskStatusUpdated(task_id=task.task_id, task_status=TaskStatus.Running)
-            )
-            event_sender.send(TaskAcknowledged(task_id=task.task_id))
-            match task:
-                case ConnectToGroup() if isinstance(
-                    current_status, (RunnerIdle, RunnerFailed)
-                ):
-                    logger.info("runner connecting")
-                    current_status = RunnerConnecting()
-                    event_sender.send(
-                        RunnerStatusUpdated(
-                            runner_id=runner_id, runner_status=current_status
-                        )
+        current_status: RunnerStatus = RunnerIdle()
+        logger.info("runner created")
+        event_sender.send(
+            RunnerStatusUpdated(runner_id=runner_id, runner_status=current_status)
+        )
+        with task_receiver as tasks:
+            for task in tasks:
+                event_sender.send(
+                    TaskStatusUpdated(
+                        task_id=task.task_id, task_status=TaskStatus.Running
                    )
-                    group = initialize_mlx(bound_instance)
-
-                    logger.info("runner connected")
-                    current_status = RunnerConnected()
-
-                # we load the model if it's connected with a group, or idle without a group. we should never tell a model to connect if it doesn't need to
-                case LoadModel() if (
-                    isinstance(current_status, RunnerConnected) and group is not None
-                ) or (isinstance(current_status, RunnerIdle) and group is None):
-                    current_status = RunnerLoading()
-                    logger.info("runner loading")
-                    event_sender.send(
-                        RunnerStatusUpdated(
-                            runner_id=runner_id, runner_status=current_status
-                        )
-                    )
-
-                    def on_model_load_timeout() -> None:
+                )
+                event_sender.send(TaskAcknowledged(task_id=task.task_id))
+                match task:
+                    case ConnectToGroup() if isinstance(
+                        current_status, (RunnerIdle, RunnerFailed)
+                    ):
+                        logger.info("runner connecting")
+                        current_status = RunnerConnecting()
                        event_sender.send(
                            RunnerStatusUpdated(
-                                runner_id=runner_id,
-                                runner_status=RunnerFailed(
-                                    error_message="Model loading timed out"
-                                ),
+                                runner_id=runner_id, runner_status=current_status
                            )
                        )
-                        time.sleep(0.5)
+                        group = initialize_mlx(bound_instance)

-                    model, tokenizer = load_mlx_items(
-                        bound_instance, group, on_timeout=on_model_load_timeout
-                    )
+                        logger.info("runner connected")
+                        current_status = RunnerConnected()

-                    current_status = RunnerLoaded()
-                    logger.info("runner loaded")
-                case StartWarmup() if isinstance(current_status, RunnerLoaded):
-                    assert model
-                    assert tokenizer
-                    current_status = RunnerWarmingUp()
-                    logger.info("runner warming up")
-                    event_sender.send(
-                        RunnerStatusUpdated(
-                            runner_id=runner_id, runner_status=current_status
+                    # we load the model if it's connected with a group, or idle without a group. we should never tell a model to connect if it doesn't need to
+                    case LoadModel() if (
+                        isinstance(current_status, RunnerConnected)
+                        and group is not None
+                    ) or (isinstance(current_status, RunnerIdle) and group is None):
+                        current_status = RunnerLoading()
+                        logger.info("runner loading")
+                        event_sender.send(
+                            RunnerStatusUpdated(
+                                runner_id=runner_id, runner_status=current_status
+                            )
                        )
-                    )

-                    logger.info(f"warming up inference for instance: {instance}")
-                    toks = warmup_inference(
-                        model=cast(Model, model),
-                        tokenizer=tokenizer,
-                        # kv_prefix_cache=kv_prefix_cache,  # supply for warmup-time prefix caching
-                    )
-                    logger.info(f"warmed up by generating {toks} tokens")
-                    logger.info(
-                        f"runner initialized in {time.time() - setup_start_time} seconds"
-                    )
-                    current_status = RunnerReady()
-                    logger.info("runner ready")
-                case ChatCompletion(task_params=task_params, command_id=command_id) if (
-                    isinstance(current_status, RunnerReady)
-                ):
-                    logger.info(f"received chat request: {str(task)[:500]}")
-                    current_status = RunnerRunning()
-                    logger.info("runner running")
-                    event_sender.send(
-                        RunnerStatusUpdated(
-                            runner_id=runner_id, runner_status=current_status
+                        model, tokenizer, sampler = load_mlx_items(
+                            bound_instance, group
                        )
-                    )
-                    with send_error_chunk_on_exception(
-                        event_sender,
-                        command_id,
-                        shard_metadata.model_meta.model_id,
-                        shard_metadata.device_rank,
-                    ):
+
+                        current_status = RunnerLoaded()
+                        logger.info("runner loaded")
+                    case StartWarmup() if isinstance(current_status, RunnerLoaded):
                        assert model
                        assert tokenizer
-                        _check_for_debug_prompts(task_params)
-
-                        # Generate responses using the actual MLX generation
-                        mlx_generator = mlx_generate(
-                            model=cast(Model, model),
-                            tokenizer=tokenizer,
-                            task=task_params,
+                        assert sampler
+                        current_status = RunnerWarmingUp()
+                        logger.info("runner warming up")
+                        event_sender.send(
+                            RunnerStatusUpdated(
+                                runner_id=runner_id, runner_status=current_status
+                            )
                        )

-                        # GPT-OSS specific parsing to match other model formats.
-                        if isinstance(model, GptOssModel):
-                            mlx_generator = parse_gpt_oss(mlx_generator)
+                        logger.info(f"warming up inference for instance: {instance}")
+                        toks = warmup_inference(
+                            model=model,
+                            tokenizer=tokenizer,
+                            sampler=sampler,
+                            # kv_prefix_cache=kv_prefix_cache,  # supply for warmup-time prefix caching
+                        )
+                        logger.info(f"warmed up by generating {toks} tokens")
+                        logger.info(
+                            f"runner initialized in {time.time() - setup_start_time} seconds"
+                        )
+                        current_status = RunnerReady()
+                        logger.info("runner ready")
+                    case ChatCompletion(
+                        task_params=task_params, command_id=command_id
+                    ) if isinstance(current_status, RunnerReady):
+                        assert model
+                        assert tokenizer
+                        assert sampler
+                        logger.info(f"received chat request: {str(task)[:500]}")
+                        current_status = RunnerRunning()
+                        logger.info("runner running")
+                        event_sender.send(
+                            RunnerStatusUpdated(
+                                runner_id=runner_id, runner_status=current_status
+                            )
+                        )
+                        assert task_params.messages[0].content is not None
+                        _check_for_debug_prompts(task_params.messages[0].content)

-                        # TODO: Add tool call parser here
-
-                        for response in mlx_generator:
+                        # Generate responses using the actual MLX generation
+                        for response in mlx_generate(
+                            model=model,
+                            tokenizer=tokenizer,
+                            sampler=sampler,
+                            task=task_params,
+                        ):
                            match response:
                                case GenerationResponse():
                                    if shard_metadata.device_rank == 0:
@@ -236,79 +177,58 @@ def main(
                                                    model=shard_metadata.model_meta.model_id,
                                                    text=response.text,
                                                    token_id=response.token,
-                                                    logprob=response.logprob,
-                                                    top_logprobs=response.top_logprobs,
                                                    finish_reason=response.finish_reason,
-                                                    stats=response.stats,
                                                ),
                                            )
                                        )
+                                    # case TokenizedResponse():
+                                    # TODO: something here ig

-                    current_status = RunnerReady()
-                    logger.info("runner ready")
-                case Shutdown():
-                    current_status = RunnerShuttingDown()
-                    logger.info("runner shutting down")
-                    event_sender.send(
-                        RunnerStatusUpdated(
-                            runner_id=runner_id, runner_status=current_status
+                        current_status = RunnerReady()
+                        logger.info("runner ready")
+                    case Shutdown():
+                        logger.info("runner shutting down")
+                        event_sender.send(
+                            TaskStatusUpdated(
+                                task_id=task.task_id, task_status=TaskStatus.Complete
+                            )
                        )
+                        break
+                    case _:
+                        raise ValueError(
+                            f"Received {task.__class__.__name__} outside of state machine in {current_status=}"
+                        )
+                event_sender.send(
+                    TaskStatusUpdated(
+                        task_id=task.task_id, task_status=TaskStatus.Complete
                    )
-                    current_status = RunnerShutdown()
-                case _:
-                    raise ValueError(
-                        f"Received {task.__class__.__name__} outside of state machine in {current_status=}"
+                )
+                event_sender.send(
+                    RunnerStatusUpdated(
+                        runner_id=runner_id, runner_status=current_status
                    )
-            event_sender.send(
-                TaskStatusUpdated(task_id=task.task_id, task_status=TaskStatus.Complete)
+                )
+        event_sender.send(
+            RunnerStatusUpdated(runner_id=runner_id, runner_status=RunnerShutdown())
+        )
+    except ClosedResourceError:
+        logger.warning("runner communication closed unexpectedly")
+    except Exception as e:
+        logger.opt(exception=e).warning(
+            f"Runner {runner_id} crashed with critical exception {e}"
+        )
+        event_sender.send(
+            RunnerStatusUpdated(
+                runner_id=runner_id,
+                runner_status=RunnerFailed(error_message=str(e)),
            )
-            event_sender.send(
-                RunnerStatusUpdated(runner_id=runner_id, runner_status=current_status)
-            )
-            if isinstance(current_status, RunnerShutdown):
-                del model, tokenizer, group
-                mx.clear_cache()
-                import gc
-
-                gc.collect()
-                break
-
-
-@cache
-def get_gpt_oss_encoding():
-    encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
-    return encoding
-
-
-def parse_gpt_oss(
-    responses: Generator[GenerationResponse],
-) -> Generator[GenerationResponse]:
-    encoding = get_gpt_oss_encoding()
-    stream = StreamableParser(encoding, role=Role.ASSISTANT)
-    thinking = False
-
-    for response in responses:
-        stream.process(response.token)
-
-        delta = stream.last_content_delta
-        ch = stream.current_channel
-
-        if ch == "analysis" and not thinking:
-            thinking = True
-            yield response.model_copy(update={"text": "<think>"})
-
-        if ch != "analysis" and thinking:
-            thinking = False
-            yield response.model_copy(update={"text": "</think>"})
-
-        if delta:
-            yield response.model_copy(update={"text": delta})
-
-        if response.finish_reason is not None:
-            if thinking:
-                yield response.model_copy(update={"text": "</think>"})
-            yield response
-            break
+        )
+    finally:
+        event_sender.close()
+        task_receiver.close()
+        event_sender.join()
+        task_receiver.join()
+        logger.info("bye from the runner")


 EXO_RUNNER_MUST_FAIL = "EXO RUNNER MUST FAIL"
@@ -316,23 +236,17 @@ EXO_RUNNER_MUST_OOM = "EXO RUNNER MUST OOM"
 EXO_RUNNER_MUST_TIMEOUT = "EXO RUNNER MUST TIMEOUT"


-def _check_for_debug_prompts(task_params: ResponsesRequest) -> None:
-    """Check for debug prompt triggers in the input.
-
-    Extracts the first user input text and checks for debug triggers.
-    """
-    prompt: str
-    if isinstance(task_params.input, str):
-        prompt = task_params.input
-    else:
-        # List of InputMessage - get first message content
-        if len(task_params.input) == 0:
-            logger.debug("Empty message list in debug prompt check")
+def _check_for_debug_prompts(
+    prompt: str | ChatCompletionMessageText | list[ChatCompletionMessageText],
+):
+    if isinstance(prompt, list):
+        if len(prompt) == 0:
+            logger.debug("Empty message prompt received in debug prompt")
            return
-        prompt = task_params.input[0].content
+        prompt = prompt[0]

-    if not prompt:
-        return
+    if isinstance(prompt, ChatCompletionMessageText):
+        prompt = prompt.text

    if EXO_RUNNER_MUST_FAIL in prompt:
        logger.info("raising exception")
--- a/src/exo/worker/runner/runner_supervisor.py
+++ b/src/exo/worker/runner/runner_supervisor.py
@@ -14,23 +14,13 @@ from anyio import (
 from anyio.abc import TaskGroup
 from loguru import logger

-from exo.shared.types.events import (
-    Event,
-    RunnerStatusUpdated,
-    TaskAcknowledged,
-    TaskStatusUpdated,
-)
-from exo.shared.types.tasks import Task, TaskId, TaskStatus
+from exo.shared.types.events import Event, RunnerStatusUpdated, TaskAcknowledged
+from exo.shared.types.tasks import Task, TaskId
 from exo.shared.types.worker.instances import BoundInstance
 from exo.shared.types.worker.runners import (
-    RunnerConnecting,
    RunnerFailed,
    RunnerIdle,
-    RunnerLoading,
-    RunnerRunning,
-    RunnerShuttingDown,
    RunnerStatus,
-    RunnerWarmingUp,
 )
 from exo.shared.types.worker.shards import ShardMetadata
 from exo.utils.channels import MpReceiver, MpSender, Sender, mp_channel
@@ -49,10 +39,10 @@ class RunnerSupervisor:
    _ev_recv: MpReceiver[Event]
    _task_sender: MpSender[Task]
    _event_sender: Sender[Event]
+    # err_path: str
    _tg: TaskGroup | None = field(default=None, init=False)
    status: RunnerStatus = field(default_factory=RunnerIdle, init=False)
    pending: dict[TaskId, anyio.Event] = field(default_factory=dict, init=False)
-    completed: set[TaskId] = field(default_factory=set, init=False)

    @classmethod
    def create(
@@ -87,6 +77,7 @@ class RunnerSupervisor:
            _ev_recv=ev_recv,
            _task_sender=task_sender,
            _event_sender=event_sender,
+            # err_path=err_path,
        )

        return self
@@ -127,10 +118,6 @@ class RunnerSupervisor:
        self._tg.cancel_scope.cancel()

    async def start_task(self, task: Task):
-        if task.task_id in self.completed:
-            logger.info(
-                f"Skipping invalid task {task} as it has already been completed"
-            )
        logger.info(f"Starting task {task}")
        event = anyio.Event()
        self.pending[task.task_id] = event
@@ -151,22 +138,6 @@ class RunnerSupervisor:
                    if isinstance(event, TaskAcknowledged):
                        self.pending.pop(event.task_id).set()
                        continue
-                    if (
-                        isinstance(event, TaskStatusUpdated)
-                        and event.task_status == TaskStatus.Complete
-                    ):
-                        # If a task has just been completed, we should be working on it.
-                        assert isinstance(
-                            self.status,
-                            (
-                                RunnerRunning,
-                                RunnerWarmingUp,
-                                RunnerLoading,
-                                RunnerConnecting,
-                                RunnerShuttingDown,
-                            ),
-                        )
-                        self.completed.add(event.task_id)
                    await self._event_sender.send(event)
            except (ClosedResourceError, BrokenResourceError) as e:
                await self._check_runner(e)
--- a/src/exo/worker/tests/unittests/conftest.py
+++ b/src/exo/worker/tests/unittests/conftest.py
@@ -1,9 +1,11 @@
-from dataclasses import dataclass, field
+from __future__ import annotations
+
+from dataclasses import dataclass

 from exo.shared.types.common import NodeId
 from exo.shared.types.memory import Memory
 from exo.shared.types.models import ModelId, ModelMetadata
-from exo.shared.types.tasks import BaseTask, TaskId
+from exo.shared.types.tasks import BaseTask
 from exo.shared.types.worker.instances import (
    BoundInstance,
    Instance,
@@ -19,7 +21,6 @@ from exo.shared.types.worker.shards import PipelineShardMetadata, ShardMetadata
 class FakeRunnerSupervisor:
    bound_instance: BoundInstance
    status: RunnerStatus
-    completed: set[TaskId] = field(default_factory=set)


 class OtherTask(BaseTask):
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Sami Khan	851cf65bfa	removed animation	2026-01-06 07:07:03 +05:00
Sami Khan	49c66f139f	fix: resolve issue #1025	2026-01-02 10:00:52 +05:00
				`@@ -1 +0,0 @@`
				`"""API adapters for different API formats (Claude, OpenAI Responses, etc.)."""`