+++ disableToc = false title = "(experimental) MLX Distributed Inference" weight = 18 url = '/features/mlx-distributed/' +++ MLX distributed inference allows you to split large language models across multiple Apple Silicon Macs (or other devices) for joint inference. Unlike federation (which distributes whole requests), MLX distributed splits a single model's layers across machines so they all participate in every forward pass. ## How It Works MLX distributed uses **pipeline parallelism** via the Ring backend: each node holds a slice of the model's layers. During inference, activations flow from rank 0 through each subsequent rank in a pipeline. The last rank gathers the final output. For high-bandwidth setups (e.g., Thunderbolt-connected Macs), **JACCL** (tensor parallelism via RDMA) is also supported, where each rank holds all layers but with sharded weights. ## Prerequisites - Two or more machines with MLX installed (Apple Silicon recommended) - Network connectivity between all nodes (TCP for Ring, RDMA/Thunderbolt for JACCL) - Same model accessible on all nodes (e.g., from Hugging Face cache) ## Quick Start with P2P The simplest way to use MLX distributed is with LocalAI's P2P auto-discovery. ### 1. Start LocalAI with P2P ```bash docker run -ti --net host \ --name local-ai \ localai/localai:latest-metal-darwin-arm64 run --p2p ``` This generates a network token. Copy it for the next step. ### 2. Start MLX Workers On each additional Mac: ```bash docker run -ti --net host \ -e TOKEN="" \ --name local-ai-mlx-worker \ localai/localai:latest-metal-darwin-arm64 worker p2p-mlx ``` Workers auto-register on the P2P network. The LocalAI server discovers them and generates a hostfile for MLX distributed. ### 3. Use the Model Load any MLX-compatible model. The `mlx-distributed` backend will automatically shard it across all available ranks: ```yaml name: llama-distributed backend: mlx-distributed parameters: model: mlx-community/Llama-3.2-1B-Instruct-4bit ``` ## Model Configuration The `mlx-distributed` backend is started automatically by LocalAI like any other backend. You configure distributed inference through the model YAML file using the `options` field: ### Ring Backend (TCP) ```yaml name: llama-distributed backend: mlx-distributed parameters: model: mlx-community/Llama-3.2-1B-Instruct-4bit options: - "hostfile:/path/to/hosts.json" - "distributed_backend:ring" ``` The **hostfile** is a JSON array where entry `i` is the `"ip:port"` that **rank `i` listens on** for ring communication. All ranks must use the same hostfile so they know how to reach each other. **Example:** Two Macs — Mac A (`192.168.1.10`) and Mac B (`192.168.1.11`): ```json ["192.168.1.10:5555", "192.168.1.11:5555"] ``` - Entry 0 (`192.168.1.10:5555`) — the address rank 0 (Mac A) listens on for ring communication - Entry 1 (`192.168.1.11:5555`) — the address rank 1 (Mac B) listens on for ring communication Port 5555 is arbitrary — use any available port, but it must be open in your firewall. ### JACCL Backend (RDMA/Thunderbolt) ```yaml name: llama-distributed backend: mlx-distributed parameters: model: mlx-community/Llama-3.2-1B-Instruct-4bit options: - "hostfile:/path/to/devices.json" - "distributed_backend:jaccl" ``` The **device matrix** is a JSON 2D array describing the RDMA device name between each pair of ranks. The diagonal is `null` (a rank doesn't talk to itself): ```json [ [null, "rdma_thunderbolt0"], ["rdma_thunderbolt0", null] ] ``` JACCL requires a **coordinator** — a TCP service that helps all ranks establish RDMA connections. Rank 0 (the LocalAI machine) is always the coordinator. Workers are told the coordinator address via their `--coordinator` CLI flag (see [Starting Workers](#jaccl-workers) below). ### Without hostfile (single-node) If no `hostfile` option is set and no `MLX_DISTRIBUTED_HOSTFILE` environment variable exists, the backend runs as a regular single-node MLX backend. This is useful for testing or when you don't need distributed inference. ### Available Options | Option | Description | |--------|-------------| | `hostfile` | Path to the hostfile JSON. Ring: array of `"ip:port"`. JACCL: device matrix. | | `distributed_backend` | `ring` (default) or `jaccl` | | `trust_remote_code` | Allow trust_remote_code for the tokenizer | | `max_tokens` | Override default max generation tokens | | `temperature` / `temp` | Sampling temperature | | `top_p` | Top-p sampling | These can also be set via environment variables (`MLX_DISTRIBUTED_HOSTFILE`, `MLX_DISTRIBUTED_BACKEND`) which are used as fallbacks when the model options don't specify them. ## Starting Workers LocalAI starts the rank 0 process (gRPC server) automatically when the model is loaded. But you still need to start **worker processes** (ranks 1, 2, ...) on the other machines. These workers participate in every forward pass but don't serve any API — they wait for commands from rank 0. ### Ring Workers On each worker machine, start a worker with the same hostfile: ```bash local-ai worker mlx-distributed --hostfile hosts.json --rank 1 ``` The `--rank` must match the worker's position in the hostfile. For example, if `hosts.json` is `["192.168.1.10:5555", "192.168.1.11:5555", "192.168.1.12:5555"]`, then: - Rank 0: started automatically by LocalAI on `192.168.1.10` - Rank 1: `local-ai worker mlx-distributed --hostfile hosts.json --rank 1` on `192.168.1.11` - Rank 2: `local-ai worker mlx-distributed --hostfile hosts.json --rank 2` on `192.168.1.12` ### JACCL Workers ```bash local-ai worker mlx-distributed \ --hostfile devices.json \ --rank 1 \ --backend jaccl \ --coordinator 192.168.1.10:5555 ``` The `--coordinator` address is the IP of the machine running LocalAI (rank 0) with any available port. Rank 0 binds the coordinator service there; workers connect to it to establish RDMA connections. ### Worker Startup Order Start workers **before** loading the model in LocalAI. When LocalAI sends the LoadModel request, rank 0 initializes `mx.distributed` which tries to connect to all ranks listed in the hostfile. If workers aren't running yet, it will time out. ## Advanced: Manual Rank 0 For advanced use cases, you can also run rank 0 manually as an external gRPC backend instead of letting LocalAI start it automatically: ```bash # On Mac A: start rank 0 manually local-ai worker mlx-distributed --hostfile hosts.json --rank 0 --addr 192.168.1.10:50051 # On Mac B: start rank 1 local-ai worker mlx-distributed --hostfile hosts.json --rank 1 # On any machine: start LocalAI pointing at rank 0 local-ai run --external-grpc-backends "mlx-distributed:192.168.1.10:50051" ``` Then use a model config with `backend: mlx-distributed` (no need for `hostfile` in options since rank 0 already has it from CLI args). ## CLI Reference ### `worker mlx-distributed` Starts a worker or manual rank 0 process. | Flag | Env | Default | Description | |------|-----|---------|-------------| | `--hostfile` | `MLX_DISTRIBUTED_HOSTFILE` | *(required)* | Path to hostfile JSON. Ring: array of `"ip:port"` where entry `i` is rank `i`'s listen address. JACCL: device matrix of RDMA device names. | | `--rank` | `MLX_RANK` | *(required)* | Rank of this process (0 = gRPC server + ring participant, >0 = worker only) | | `--backend` | `MLX_DISTRIBUTED_BACKEND` | `ring` | `ring` (TCP pipeline parallelism) or `jaccl` (RDMA tensor parallelism) | | `--addr` | `MLX_DISTRIBUTED_ADDR` | `localhost:50051` | gRPC API listen address (rank 0 only, for LocalAI or external access) | | `--coordinator` | `MLX_JACCL_COORDINATOR` | | JACCL coordinator `ip:port` — rank 0's address for RDMA setup (all ranks must use the same value) | ### `worker p2p-mlx` P2P mode — auto-discovers peers and generates hostfile. | Flag | Env | Default | Description | |------|-----|---------|-------------| | `--token` | `TOKEN` | *(required)* | P2P network token | | `--mlx-listen-port` | `MLX_LISTEN_PORT` | `5555` | Port for MLX communication | | `--mlx-backend` | `MLX_DISTRIBUTED_BACKEND` | `ring` | Backend type: `ring` or `jaccl` | ## Troubleshooting - **All ranks download the model independently.** Each node auto-downloads from Hugging Face on first use via `mlx_lm.load()`. On rank 0 (started by LocalAI), models are downloaded to LocalAI's model directory (`HF_HOME` is set automatically). On workers, models go to the default HF cache (`~/.cache/huggingface/hub`) unless you set `HF_HOME` yourself. - **Timeout errors:** If ranks can't connect, check firewall rules. The Ring backend uses TCP on the ports listed in the hostfile. Start workers before loading the model. - **Rank assignment:** In P2P mode, rank 0 is always the LocalAI server. Worker ranks are assigned by sorting node IDs. - **Performance:** Pipeline parallelism adds latency proportional to the number of ranks. For best results, use the fewest ranks needed to fit your model in memory. ## Acknowledgements The MLX distributed auto-parallel sharding implementation is based on [exo](https://github.com/exo-explore/exo).