exo connects all your devices into an AI cluster. Not only does exo enable running models larger than would fit on a single device, but with day-0 support for RDMA over Thunderbolt, makes models run faster as you add more devices.
Features
- Automatic Device Discovery: Devices running exo automatically discover each other - no manual configuration.
- RDMA over Thunderbolt: exo ships with day-0 support for RDMA over Thunderbolt 5, enabling 99% reduction in latency between devices.
- Topology-Aware Auto Parallel: exo figures out the best way to split your model across all available devices based on a realtime view of your device topology. It takes into account device resources and network latency/bandwidth between each link.
- Tensor Parallelism: exo supports sharding models, for up to 1.8x speedup on 2 devices and 3.2x speedup on 4 devices.
- MLX Support: exo uses MLX as an inference backend and MLX distributed for distributed communication.
- Multiple API Compatibility: Compatible with OpenAI Chat Completions API, Claude Messages API, OpenAI Responses API, and Ollama API - use your existing tools and clients.
- Custom Model Support: Load custom models from HuggingFace hub to expand the range of available models.
Dashboard
exo includes a built-in dashboard for managing your cluster and chatting with models.
4 × 512GB M3 Ultra Mac Studio running DeepSeek v3.1 (8-bit) and Kimi-K2-Thinking (4-bit)
Benchmarks
Qwen3-235B (8-bit) on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA
Source: Jeff Geerling: 15 TB VRAM on Mac Studio – RDMA over Thunderbolt 5
DeepSeek v3.1 671B (8-bit) on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA
Source: Jeff Geerling: 15 TB VRAM on Mac Studio – RDMA over Thunderbolt 5
Kimi K2 Thinking (native 4-bit) on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA
Source: Jeff Geerling: 15 TB VRAM on Mac Studio – RDMA over Thunderbolt 5
Quick Start
Devices running exo automatically discover each other, without needing any manual configuration. Each device provides an API and a dashboard for interacting with your cluster (runs at http://localhost:52415).
There are two ways to run exo:
Run from Source (macOS)
If you have Nix installed, you can skip most of the steps below and run exo directly:
nix run .#exo
Note: To accept the Cachix binary cache (and avoid the Xcode Metal ToolChain), add to /etc/nix/nix.conf:
trusted-users = root (or your username)
experimental-features = nix-command flakes
Then restart the Nix daemon: sudo launchctl kickstart -k system/org.nixos.nix-daemon
Prerequisites:
-
Xcode (provides the Metal ToolChain required for MLX compilation)
-
brew (for simple package management on macOS)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" -
uv (for Python dependency management)
-
node (for building the dashboard)
brew install uv node -
rust (to build Rust bindings, nightly for now)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh rustup toolchain install nightly -
macmon (for hardware monitoring on Apple Silicon)
Install the pinned fork revision used by this repo instead of Homebrew
macmon. Homebrewmacmon 0.6.1still crashes on Apple M5.cargo install --git https://github.com/vladkens/macmon \ --rev a1cd06b6cc0d5e61db24fd8832e74cd992097a7d \ macmon \ --force
Clone the repo, build the dashboard, and run exo:
# Clone exo
git clone https://github.com/exo-explore/exo
# Build dashboard
cd exo/dashboard && npm install && npm run build && cd ..
# Run exo
uv run exo
This starts the exo dashboard and API at http://localhost:52415/
Please view the section on RDMA to enable this feature on MacOS >=26.2!
Run from Source (Linux)
Prerequisites:
- uv (for Python dependency management)
- node (for building the dashboard) - version 18 or higher
- rust (to build Rust bindings, nightly for now)
Installation methods:
Option 1: Using system package manager (Ubuntu/Debian example):
# Install Node.js and npm
sudo apt update
sudo apt install nodejs npm
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install Rust (using rustup)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup toolchain install nightly
Option 2: Using Homebrew on Linux (if preferred):
# Install Homebrew on Linux
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install dependencies
brew install uv node
# Install Rust (using rustup)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup toolchain install nightly
Note: The macmon package is macOS-only and not required for Linux.
Clone the repo, build the dashboard, and run exo:
# Clone exo
git clone https://github.com/exo-explore/exo
# Build dashboard
cd exo/dashboard && npm install && npm run build && cd ..
# Run exo
uv run exo
This starts the exo dashboard and API at http://localhost:52415/
Important note for Linux users: Currently, exo runs on CPU on Linux. GPU support for Linux platforms is under development. If you'd like to see support for your specific Linux hardware, please search for existing feature requests or create a new one.
Configuration Options:
-
--no-worker: Run exo without the worker component. Useful for coordinator-only nodes that handle networking and orchestration but don't execute inference tasks. This is helpful for machines without sufficient GPU resources but with good network connectivity.uv run exo --no-worker
File Locations (Linux):
exo follows the XDG Base Directory Specification on Linux:
- Configuration files:
~/.config/exo/(or$XDG_CONFIG_HOME/exo/) - Data files:
~/.local/share/exo/(or$XDG_DATA_HOME/exo/) - Cache files:
~/.cache/exo/(or$XDG_CACHE_HOME/exo/) - Log files:
~/.cache/exo/exo_log/(with automatic log rotation) - Custom model cards:
~/.local/share/exo/custom_model_cards/
You can override these locations by setting the corresponding XDG environment variables.
macOS App
exo ships a macOS app that runs in the background on your Mac.
The macOS app requires macOS Tahoe 26.2 or later.
Download the latest build here: EXO-latest.dmg.
The app will ask for permission to modify system settings and install a new Network profile. Improvements to this are being worked on.
Custom Namespace for Cluster Isolation:
The macOS app includes a custom namespace feature that allows you to isolate your exo cluster from others on the same network. This is configured through the EXO_LIBP2P_NAMESPACE setting:
-
Use cases:
- Running multiple separate exo clusters on the same network
- Isolating development/testing clusters from production clusters
- Preventing accidental cluster joining
-
Configuration: Access this setting in the app's Advanced settings (or set the
EXO_LIBP2P_NAMESPACEenvironment variable when running from source)
The namespace is logged on startup for debugging purposes.
Uninstalling the macOS App
The recommended way to uninstall is through the app itself: click the menu bar icon → Advanced → Uninstall. This cleanly removes all system components.
If you've already deleted the app, you can run the standalone uninstaller script:
sudo ./app/EXO/uninstall-exo.sh
This removes:
- Network setup LaunchDaemon
- Network configuration script
- Log files
- The "exo" network location
Note: You'll need to manually remove EXO from Login Items in System Settings → General → Login Items.
Enabling RDMA on macOS
RDMA is a new capability added to macOS 26.2. It works on any Mac with Thunderbolt 5 (M4 Pro Mac Mini, M4 Max Mac Studio, M4 Max MacBook Pro, M3 Ultra Mac Studio).
Please refer to the caveats for immediate troubleshooting.
To enable RDMA on macOS, follow these steps:
- Shut down your Mac.
- Hold down the power button for 10 seconds until the boot menu appears.
- Select "Options" to enter Recovery mode.
- When the Recovery UI appears, open the Terminal from the Utilities menu.
- In the Terminal, type:
and press Enter.
rdma_ctl enable - Reboot your Mac.
After that, RDMA will be enabled in macOS and exo will take care of the rest.
Important Caveats
- Devices that wish to be part of an RDMA cluster must be connected to all other devices in the cluster.
- The cables must support TB5.
- On a Mac Studio, you cannot use the Thunderbolt 5 port next to the Ethernet port.
- If running from source, please use the script found at
tmp/set_rdma_network_config.sh, which will disable Thunderbolt Bridge and set dhcp on each RDMA port. - RDMA ports may be unable to discover each other on different versions of MacOS. Please ensure that OS versions match exactly (even beta version numbers) on all devices.
Environment Variables
exo supports several environment variables for configuration:
| Variable | Description | Default |
|---|---|---|
EXO_DEFAULT_MODELS_DIR |
Default directory for model downloads and caches. Always first in the writable dirs list. | ~/.local/share/exo/models (Linux) or ~/.exo/models (macOS) |
EXO_MODELS_DIRS |
Colon-separated additional writable directories for model downloads. Checked in order after the default; first with enough free space is used. | None |
EXO_MODELS_READ_ONLY_DIRS |
Colon-separated read-only directories to search for pre-downloaded models (e.g., NFS mounts, shared storage). Models here cannot be deleted. | None |
EXO_OFFLINE |
Run without internet connection (uses only local models) | false |
EXO_ENABLE_IMAGE_MODELS |
Enable image model support | false |
EXO_LIBP2P_NAMESPACE |
Custom namespace for cluster isolation | None |
EXO_FAST_SYNCH |
Control MLX_METAL_FAST_SYNCH behavior (for JACCL backend) | Auto |
EXO_TRACING_ENABLED |
Enable distributed tracing for performance analysis | false |
Example usage:
# Use pre-downloaded models from NFS mount (read-only)
EXO_MODELS_READ_ONLY_DIRS=/mnt/nfs/models:/opt/ai-models uv run exo
# Download models to an external SSD (falls back to default dir if full)
EXO_MODELS_DIRS=/Volumes/ExternalSSD/exo-models uv run exo
# Run in offline mode
EXO_OFFLINE=true uv run exo
# Enable image models
EXO_ENABLE_IMAGE_MODELS=true uv run exo
# Use custom namespace for cluster isolation
EXO_LIBP2P_NAMESPACE=my-dev-cluster uv run exo
Using the API
exo provides multiple API-compatible interfaces for maximum compatibility with existing tools:
- OpenAI Chat Completions API - Compatible with OpenAI clients
- Claude Messages API - Compatible with Anthropic's Claude format
- OpenAI Responses API - Compatible with OpenAI's Responses format
- Ollama API - Compatible with Ollama and tools like OpenWebUI
If you prefer to interact with exo via the API, here is an example creating an instance of a small model (mlx-community/Llama-3.2-1B-Instruct-4bit), sending a chat completions request and deleting the instance.
1. Preview instance placements
The /instance/previews endpoint will preview all valid placements for your model.
curl "http://localhost:52415/instance/previews?model_id=llama-3.2-1b"
Sample response:
{
"previews": [
{
"model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"sharding": "Pipeline",
"instance_meta": "MlxRing",
"instance": {...},
"memory_delta_by_node": {"local": 729808896},
"error": null
}
// ...possibly more placements...
]
}
This will return all valid placements for this model. Pick a placement that you like.
To pick the first one, pipe into jq:
curl "http://localhost:52415/instance/previews?model_id=llama-3.2-1b" | jq -c '.previews[] | select(.error == null) | .instance' | head -n1
2. Create a model instance
Send a POST to /instance with your desired placement in the instance field (the full payload must match types as in CreateInstanceParams), which you can copy from step 1:
curl -X POST http://localhost:52415/instance \
-H 'Content-Type: application/json' \
-d '{
"instance": {...}
}'
Sample response:
{
"message": "Command received.",
"command_id": "e9d1a8ab-...."
}
3. Send a chat completion
Now, make a POST to /v1/chat/completions (the same format as OpenAI's API):
curl -N -X POST http://localhost:52415/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [
{"role": "user", "content": "What is Llama 3.2 1B?"}
],
"stream": true
}'
4. Delete the instance
When you're done, delete the instance by its ID (find it via /state or /instance endpoints):
curl -X DELETE http://localhost:52415/instance/YOUR_INSTANCE_ID
Claude Messages API Compatibility
Use the Claude Messages API format with the /v1/messages endpoint:
curl -N -X POST http://localhost:52415/v1/messages \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [
{"role": "user", "content": "Hello"}
],
"max_tokens": 1024,
"stream": true
}'
OpenAI Responses API Compatibility
Use the OpenAI Responses API format with the /v1/responses endpoint:
curl -N -X POST http://localhost:52415/v1/responses \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [
{"role": "user", "content": "Hello"}
],
"stream": true
}'
Ollama API Compatibility
exo supports Ollama API endpoints for compatibility with tools like OpenWebUI:
# Ollama chat
curl -X POST http://localhost:52415/ollama/api/chat \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [
{"role": "user", "content": "Hello"}
],
"stream": false
}'
# List models (Ollama format)
curl http://localhost:52415/ollama/api/tags
Custom Model Loading from HuggingFace
You can add custom models from the HuggingFace hub:
curl -X POST http://localhost:52415/models/add \
-H 'Content-Type: application/json' \
-d '{
"model_id": "mlx-community/my-custom-model"
}'
Security Note:
Custom models requiring trust_remote_code in their configuration must be explicitly enabled (default is false) for security. Only enable this if you trust the model's remote code execution. Models are fetched from HuggingFace and stored locally as custom model cards.
Other useful API endpoints:*
- List all models:
curl http://localhost:52415/models - List downloaded models only:
curl http://localhost:52415/models?status=downloaded - Search HuggingFace:
curl "http://localhost:52415/models/search?query=llama&limit=10" - Inspect instance IDs and deployment state:
curl http://localhost:52415/state
For further details, see:
- API documentation in docs/api.md.
- API types and endpoints in src/exo/master/api.py.
Benchmarking
The exo-bench tool measures model prefill and token generation speed across different placement configurations. This helps you optimize model performance and validate improvements.
Prerequisites:
- Nodes should be running with
uv run exobefore benchmarking - The tool uses the
/bench/chat/completionsendpoint
Basic usage:
uv run bench/exo_bench.py \
--model Llama-3.2-1B-Instruct-4bit \
--pp 128,256,512 \
--tg 128,256
Key parameters:
--model: Model to benchmark (short ID or HuggingFace ID)--pp: Prompt size hints (comma-separated integers)--tg: Generation lengths (comma-separated integers)--max-nodes: Limit placements to N nodes (default: 4)--instance-meta: Filter byring,jaccl, orboth(default: both)--sharding: Filter bypipeline,tensor, orboth(default: both)--repeat: Number of repetitions per configuration (default: 1)--warmup: Warmup runs per placement (default: 0)--json-out: Output file for results (default: bench/results.json)
Example with filters:
uv run bench/exo_bench.py \
--model Llama-3.2-1B-Instruct-4bit \
--pp 128,512 \
--tg 128 \
--max-nodes 2 \
--sharding tensor \
--repeat 3 \
--json-out my-results.json
The tool outputs performance metrics including prompt tokens per second (prompt_tps), generation tokens per second (generation_tps), and peak memory usage for each configuration.
Composable benchmarks (CLI)
For benchmarks that need an eco-managed cluster and a stable JSON result format, exo ships a CLI under bench/cli/. The CLI handles cluster lifecycle, instance placement, model-metadata resolution (HuggingFace), and result capture; benchmark logic lives in bench/lib/ so each new benchmark is a small library module + a CLI subcommand.
Run the prompt-TPS / decode-TPS vs context-size sweep:
The defaults assume a multi-node, Thunderbolt-connected cluster with tensor parallelism + JACCL — the typical exo benchmarking setup:
# Defaults: --sharding Tensor --comm MlxJaccl --thunderbolt a2a, with
# memory/disk minimums auto-derived from the HF model size. eco picks
# `--nodes` hosts from its inventory that form a TB clique and satisfy
# those constraints.
uv run python -m bench.cli context-scaling \
--model mlx-community/Qwen3-30B-A3B-4bit --nodes 2 --num-steps 32
# Pin to specific hosts (defaults still apply for sharding/comm/topology)
uv run python -m bench.cli context-scaling --hosts s4,s9 \
--model mlx-community/Qwen3-30B-A3B-4bit --num-steps 16
# Single-node smoke test: explicit single-node placement overrides
uv run python -m bench.cli context-scaling --hosts s4 \
--model mlx-community/Llama-3.2-1B-Instruct-4bit --num-steps 4 \
--sharding Pipeline --comm MlxRing --thunderbolt none
# Override the auto-derived ramp / cold controls
uv run python -m bench.cli context-scaling --hosts s4,s9 --model X \
--pp-step 4096 --num-steps 32 --cold-controls 8192,32768,65536,131072
# Custom output dir + tags
uv run python -m bench.cli context-scaling --hosts s4,s9 --model X \
--output-dir bench/results/2026-05-10/ --tag operator=$USER --tag run=full
# Run from a TOML config (CLI flags override values from the file)
uv run python -m bench.cli context-scaling \
--config bench/configs/context_scaling.example.toml
Shared flags (every benchmark subcommand has these):
--model— HuggingFace model id (required)--config <path>.toml— load run parameters from a TOML file--sharding {Pipeline,Tensor}(default Tensor) — sharding mode--comm {MlxRing,MlxJaccl}(default MlxJaccl) — inter-node comm mode--min-nodes N(default 1) — minimum nodes for the placement--hosts s4,s9— pin to specific hosts; bypasses constraint search--nodes N(default 1) — number of cluster hosts to reserve when--hostsis unset (distinct from--min-nodes, which controls the model's instance placement)--thunderbolt {a2a,ring,none}(default a2a) — required Thunderbolt topology--chip "M3 Ultra"— required chip (substring match; comment to allow any)--min-memory-gb,--max-memory-gb,--min-disk-gb,--max-disk-gb— host RAM / disk constraints. The minimums are auto-derived from the HF model size (×1.30 + 1 GiB for memory, ×1.10 + 1 GiB for disk) when not supplied; explicit values always win.--evict-downloads(default on) — auto-evict existing models smallest-first on disk-full; pass--no-evict-downloadsto keep--cleanup-instance(default on) — delete the placed instance after exit; pass--no-cleanup-instanceto leave it running for debugging--output-dir bench/results— base directory for JSON results (subcommands add their own subfolder)--tag key=value— append tometadata.tags(repeatable)
Context-scaling-specific flags:
--num-steps N— number of equally-spaced ramp points (default 32)--pp-step Δ— explicit Δ in tokens (overrides auto-derivation frommax_position_embeddings)--fraction-of-max F— when Δ is auto-derived, useF × max_contextas the upper bound--tg— tokens generated per step (default 64)--warmup— warmup requests atpp=Δ(default 1)--cold-controls auto(4 evenly-spaced points across the ramp) or--cold-controls 8192,32768,…(explicit pp values). Default: no cold controls.
Output: each run writes bench/results/<benchmark>/<run_id>.json plus a latest.json symlink. The JSON contains metadata (exo SHA, hostname, platform, user tags), the full cluster snapshot at run start, the resolved + derived params, per-step rows, cold-control rows, and derived summaries (t_cum_seconds, control_gaps).
Multi-run campaigns — bench campaign runs a list of bench invocations from a single TOML file. Each [[runs]] entry is its own cluster deploy + bench + teardown, with a shared [defaults] table for DRY config:
# bench/configs/llama-family-smoke.toml
[defaults]
nodes = 4
num_steps = 8
fraction_of_max = 0.5
[[runs]]
subcommand = "context-scaling"
model = "mlx-community/Llama-3.2-3B-Instruct-4bit"
[runs.tags]
model_short = "llama-3.2-3b-4bit"
[[runs]]
subcommand = "context-scaling"
model = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
[runs.tags]
model_short = "llama-3.1-8b-4bit"
[plot]
label_tag = "model_short"
uv run python -m bench.cli campaign bench/configs/llama-family-smoke.toml
After all runs finish, an optional [plot] table triggers a comparison plot per benchmark group (one PNG per benchmark type with ≥2 runs).
Plotting — bench plot renders any results JSON to a 2-panel PNG (prompt_tps + generation_tps vs pp_tokens, cold controls overlaid as 'x' markers):
# Plot the most recent run next to its JSON
uv run python -m bench.cli plot bench/results/context_scaling/latest.json
# Compare multiple runs (one line per run; legend label = the chosen tag)
uv run python -m bench.cli plot run_a.json run_b.json --label-tag operator
# Custom output path + title
uv run python -m bench.cli plot run.json --output /tmp/scaling.png --title "30B 4-node sweep"
The benchmark type is detected from each JSON's metadata.benchmark, so the same plot command will work for future benchmarks once their renderer is registered in bench/lib/plotting.py.
Methodology for the context-scaling benchmark is documented in detail in bench/lib/context_scaling.py's module docstring and in bench/METHODOLOGY.md.
Hardware Accelerator Support
On macOS, exo uses the GPU. On Linux, exo currently runs on CPU. We are working on extending hardware accelerator support. If you'd like support for a new hardware platform, please search for an existing feature request and add a thumbs up so we know what hardware is important to the community.
Contributing
See CONTRIBUTING.md for guidelines on how to contribute to exo.

