exo/docs/api.md

# EXO API – Technical Reference

This document describes the REST API exposed by the **EXO** service, as implemented in:

`src/exo/master/api.py`

The API is used to manage model instances in the cluster, inspect cluster state, and perform inference using multiple API-compatible interfaces.

Base URL example:

```
http://localhost:52415
```

## 1. General / Meta Endpoints

### Get Master Node ID

**GET** `/node_id`

Returns the identifier of the current master node.

**Response (example):**

```json
{
  "node_id": "node-1234"
}
```

### Get Cluster State

**GET** `/state`

Returns the current state of the cluster, including nodes and active instances.

**Response:**
JSON object describing topology, nodes, and instances.

### Get Events

**GET** `/events`

Returns the list of internal events recorded by the master (mainly for debugging and observability).

**Response:**
Array of event objects.

## 2. Model Instance Management

### Create Instance

**POST** `/instance`

Creates a new model instance in the cluster.

**Request body (example):**

```json
{
  "instance": {
    "model_id": "llama-3.2-1b",
    "placement": { }
  }
}
```

**Response:**
JSON description of the created instance.

### Delete Instance

**DELETE** `/instance/{instance_id}`

Deletes an existing instance by ID.

**Path parameters:**

* `instance_id`: string, ID of the instance to delete

**Response:**
Status / confirmation JSON.

### Get Instance

**GET** `/instance/{instance_id}`

Returns details of a specific instance.

**Path parameters:**

* `instance_id`: string

**Response:**
JSON description of the instance.

### Preview Placements

**GET** `/instance/previews?model_id=...`

Returns possible placement previews for a given model.

**Query parameters:**

* `model_id`: string, required

**Response:**
Array of placement preview objects.

### Compute Placement

**GET** `/instance/placement`

Computes a placement for a potential instance without creating it.

**Query parameters (typical):**

* `model_id`: string
* `sharding`: string or config
* `instance_meta`: JSON-encoded metadata
* `min_nodes`: integer

**Response:**
JSON object describing the proposed placement / instance configuration.

### Place Instance (Dry Operation)

**POST** `/place_instance`

Performs a placement operation for an instance (planning step), without necessarily creating it.

**Request body:**
JSON describing the instance to be placed.

**Response:**
Placement result.

## 3. Models

### List Models

**GET** `/models`
**GET** `/v1/models` (alias)

Returns the list of available models and their metadata.

**Query parameters:**

* `status`: string (optional) - Filter by `downloaded` to show only downloaded models

**Response:**
Array of model descriptors including `is_custom` field for custom HuggingFace models.

### Add Custom Model

**POST** `/models/add`

Add a custom model from HuggingFace hub.

**Request body (example):**

```json
{
  "model_id": "mlx-community/my-custom-model"
}
```

**Response:**
Model descriptor for the added model.

**Security note:**
Models with `trust_remote_code` enabled in their configuration require explicit opt-in (default is false) for security.

### Delete Custom Model

**DELETE** `/models/custom/{model_id}`

Delete a user-added custom model card.

**Path parameters:**

* `model_id`: string, ID of the custom model to delete

**Response:**
Confirmation JSON with deleted model ID.

### Search Models

**GET** `/models/search`

Search HuggingFace Hub for mlx-community models.

**Query parameters:**

* `query`: string (optional) - Search query
* `limit`: integer (default: 20) - Maximum number of results

**Response:**
Array of HuggingFace model search results.

## 4. Inference / Chat Completions

### OpenAI-Compatible Chat Completions

**POST** `/v1/chat/completions`

Executes a chat completion request using an OpenAI-compatible schema. Supports streaming and non-streaming modes.

**Request body (example):**

```json
{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello" }
  ],
  "stream": false
}
```

**Request parameters:**

* `model`: string, required - Model ID to use
* `messages`: array, required - Conversation messages
* `stream`: boolean (default: false) - Enable streaming responses
* `max_tokens`: integer (optional) - Maximum tokens to generate
* `temperature`: float (optional) - Sampling temperature
* `top_p`: float (optional) - Nucleus sampling parameter
* `top_k`: integer (optional) - Top-k sampling parameter
* `stop`: string or array (optional) - Stop sequences
* `seed`: integer (optional) - Random seed for reproducibility
* `enable_thinking`: boolean (optional) - Enable thinking mode for capable models (DeepSeek V3.1, Qwen3, GLM-4.7)
* `tools`: array (optional) - Tool definitions for function calling
* `logprobs`: boolean (optional) - Return log probabilities
* `top_logprobs`: integer (optional) - Number of top log probabilities to return

**Response:**
OpenAI-compatible chat completion response.

**Streaming response format:**
When `stream=true`, returns Server-Sent Events (SSE) with format:

```
data: {"id":"...","object":"chat.completion","created":...,"model":"...","choices":[...]}

data: [DONE]
```

**Non-streaming response includes usage statistics:**

```json
{
  "id": "...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "llama-3.2-1b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 8,
    "total_tokens": 23
  }
}
```

**Cancellation:**
You can cancel an active generation by closing the HTTP connection. The server detects the disconnection and stops processing.

### Claude Messages API

**POST** `/v1/messages`

Executes a chat completion request using the Claude Messages API format. Supports streaming and non-streaming modes.

**Request body (example):**

```json
{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "max_tokens": 1024,
  "stream": false
}
```

**Streaming response format:**
When `stream=true`, returns Server-Sent Events with Claude-specific event types:

* `message_start` - Message generation started
* `content_block_start` - Content block started
* `content_block_delta` - Incremental content chunk
* `content_block_stop` - Content block completed
* `message_delta` - Message metadata updates
* `message_stop` - Message generation completed

**Response:**
Claude-compatible messages response.

### OpenAI Responses API

**POST** `/v1/responses`

Executes a chat completion request using the OpenAI Responses API format. Supports streaming and non-streaming modes.

**Request body (example):**

```json
{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "stream": false
}
```

**Streaming response format:**
When `stream=true`, returns Server-Sent Events with response-specific event types:

* `response.created` - Response generation started
* `response.in_progress` - Response is being generated
* `response.output_item.added` - New output item added
* `response.output_item.done` - Output item completed
* `response.done` - Response generation completed

**Response:**
OpenAI Responses API-compatible response.

### Benchmarked Chat Completions

**POST** `/bench/chat/completions`

Same as `/v1/chat/completions`, but also returns performance and generation statistics.

**Request body:**
Same schema as `/v1/chat/completions`.

**Response:**
Chat completion plus benchmarking metrics including:

* `prompt_tps` - Tokens per second during prompt processing
* `generation_tps` - Tokens per second during generation
* `prompt_tokens` - Number of prompt tokens
* `generation_tokens` - Number of generated tokens
* `peak_memory_usage` - Peak memory used during generation

### Cancel Command

**POST** `/v1/cancel/{command_id}`

Cancels an active generation command (text or image). Notifies workers and closes the stream.

**Path parameters:**

* `command_id`: string, ID of the command to cancel

**Response (example):**

```json
{
  "message": "Command cancelled.",
  "command_id": "cmd-abc-123"
}
```

Returns 404 if the command is not found or already completed.

## 5. Ollama API Compatibility

EXO provides Ollama API compatibility for tools like OpenWebUI.

### Ollama Chat

**POST** `/ollama/api/chat`
**POST** `/ollama/api/api/chat` (alias)
**POST** `/ollama/api/v1/chat` (alias)

Execute a chat request using Ollama API format.

**Request body (example):**

```json
{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "stream": false
}
```

**Response:**
Ollama-compatible chat response.

### Ollama Generate

**POST** `/ollama/api/generate`

Execute a text generation request using Ollama API format.

**Request body (example):**

```json
{
  "model": "llama-3.2-1b",
  "prompt": "Hello",
  "stream": false
}
```

**Response:**
Ollama-compatible generation response.

### Ollama Tags

**GET** `/ollama/api/tags`
**GET** `/ollama/api/api/tags` (alias)
**GET** `/ollama/api/v1/tags` (alias)

Returns list of downloaded models in Ollama tags format.

**Response:**
Array of model tags with metadata.

### Ollama Show

**POST** `/ollama/api/show`

Returns model information in Ollama show format.

**Request body:**

```json
{
  "name": "llama-3.2-1b"
}
```

**Response:**
Model details including modelfile and family.

### Ollama PS

**GET** `/ollama/api/ps`

Returns list of running models (active instances).

**Response:**
Array of active model instances.

### Ollama Version

**GET** `/ollama/api/version`
**HEAD** `/ollama/` (alias)
**HEAD** `/ollama/api/version` (alias)

Returns version information for Ollama API compatibility.

**Response:**

```json
{
  "version": "exo v1.0"
}
```

## 6. Image Generation & Editing

### Image Generation

**POST** `/v1/images/generations`

Executes an image generation request using an OpenAI-compatible schema with additional advanced_params. Supports both streaming and non-streaming modes.

**Request body (example):**

```json
{
  "prompt": "a robot playing chess",
  "model": "exolabs/FLUX.1-dev",
  "n": 1,
  "size": "1024x1024",
  "stream": false,
  "response_format": "b64_json"
}
```

**Request parameters:**

* `prompt`: string, required - Text description of the image
* `model`: string, required - Image model ID
* `n`: integer (default: 1) - Number of images to generate
* `size`: string (default: "auto") - Image dimensions. Supported sizes:
  - `512x512`
  - `768x768`
  - `1024x768`
  - `768x1024`
  - `1024x1024`
  - `1024x1536`
  - `1536x1024`
  - `1024x1365`
  - `1365x1024`
* `stream`: boolean (default: false) - Enable streaming for partial images
* `partial_images`: integer (default: 0) - Number of partial images to stream during generation
* `response_format`: string (default: "b64_json") - Either `url` or `b64_json`
* `quality`: string (default: "medium") - Either `high`, `medium`, or `low`
* `output_format`: string (default: "png") - Either `png`, `jpeg`, or `webp`
* `advanced_params`: object (optional) - Advanced generation parameters

**Advanced Parameters (`advanced_params`):**

| Parameter | Type | Constraints | Description |
|-----------|------|-------------|-------------|
| `seed` | int | >= 0 | Random seed for reproducible generation |
| `num_inference_steps` | int | 1-100 | Number of denoising steps |
| `guidance` | float | 1.0-20.0 | Classifier-free guidance scale |
| `negative_prompt` | string | - | Text describing what to avoid in the image |

**Non-streaming response:**

```json
{
  "created": 1234567890,
  "data": [
    {
      "b64_json": "iVBORw0KGgoAAAANSUhEUgAA...",
      "url": null
    }
  ]
}
```

**Streaming response format:**
When `stream=true` and `partial_images > 0`, returns Server-Sent Events:

```
data: {"type":"partial","image_index":0,"partial_index":1,"total_partials":5,"format":"png","data":{"b64_json":"..."}}

data: {"type":"final","image_index":0,"format":"png","data":{"b64_json":"..."}}

data: [DONE]
```

### Image Editing

**POST** `/v1/images/edits`

Executes an image editing request (img2img) using FLUX.1-Kontext-dev or similar models.

**Request (multipart/form-data):**

* `image`: file, required - Input image to edit
* `prompt`: string, required - Text description of desired changes
* `model`: string, required - Image editing model ID (e.g., `exolabs/FLUX.1-Kontext-dev`)
* `n`: integer (default: 1) - Number of edited images to generate
* `size`: string (optional) - Output image dimensions
* `response_format`: string (default: "b64_json") - Either `url` or `b64_json`
* `input_fidelity`: string (default: "low") - Either `low` or `high` - Controls how closely the output follows the input image
* `stream`: string (default: "false") - Enable streaming
* `partial_images`: string (default: "0") - Number of partial images to stream
* `quality`: string (default: "medium") - Either `high`, `medium`, or `low`
* `output_format`: string (default: "png") - Either `png`, `jpeg`, or `webp`
* `advanced_params`: string (optional) - JSON-encoded advanced parameters

**Response:**
Same format as `/v1/images/generations`.

### Benchmarked Image Generation

**POST** `/bench/images/generations`

Same as `/v1/images/generations`, but also returns generation statistics.

**Request body:**
Same schema as `/v1/images/generations`.

**Response:**
Image generation plus benchmarking metrics including:

* `seconds_per_step` - Average time per denoising step
* `total_generation_time` - Total generation time
* `num_inference_steps` - Number of inference steps used
* `num_images` - Number of images generated
* `image_width` - Output image width
* `image_height` - Output image height
* `peak_memory_usage` - Peak memory used during generation

### Benchmarked Image Editing

**POST** `/bench/images/edits`

Same as `/v1/images/edits`, but also returns generation statistics.

**Request:**
Same schema as `/v1/images/edits`.

**Response:**
Same format as `/bench/images/generations`, including `generation_stats`.

### List Images

**GET** `/images`

List all stored images.

**Response:**
Array of image metadata including URLs and expiration times.

### Get Image

**GET** `/images/{image_id}`

Retrieve a stored image by ID.

**Path parameters:**

* `image_id`: string, ID of the image

**Response:**
Image file with appropriate content type.

## 7. Complete Endpoint Summary

```
# General
GET     /node_id
GET     /state
GET     /events

# Instance Management
POST    /instance
GET     /instance/{instance_id}
DELETE  /instance/{instance_id}
GET     /instance/previews
GET     /instance/placement
POST    /place_instance

# Models
GET     /models
GET     /v1/models
POST    /models/add
DELETE  /models/custom/{model_id}
GET     /models/search

# Text Generation (OpenAI Chat Completions)
POST    /v1/chat/completions
POST    /bench/chat/completions

# Text Generation (Claude Messages API)
POST    /v1/messages

# Text Generation (OpenAI Responses API)
POST    /v1/responses

# Text Generation (Ollama API)
POST    /ollama/api/chat
POST    /ollama/api/api/chat
POST    /ollama/api/v1/chat
POST    /ollama/api/generate
GET     /ollama/api/tags
GET     /ollama/api/api/tags
GET     /ollama/api/v1/tags
POST    /ollama/api/show
GET     /ollama/api/ps
GET     /ollama/api/version
HEAD    /ollama/
HEAD    /ollama/api/version

# Command Control
POST    /v1/cancel/{command_id}

# Image Generation
POST    /v1/images/generations
POST    /bench/images/generations
POST    /v1/images/edits
POST    /bench/images/edits
GET     /images
GET     /images/{image_id}
```

## 8. Notes

### API Compatibility

EXO provides multiple API-compatible interfaces:

* **OpenAI Chat Completions API** - Compatible with OpenAI clients and tools
* **Claude Messages API** - Compatible with Anthropic's Claude API format
* **OpenAI Responses API** - Compatible with OpenAI's Responses API format
* **Ollama API** - Compatible with Ollama and tools like OpenWebUI

Existing OpenAI, Claude, or Ollama clients can be pointed to EXO by changing the base URL.

### Custom Models

You can add custom models from HuggingFace using the `/models/add` endpoint. Custom models are identified by the `is_custom` field in model list responses.

**Security:** Models requiring `trust_remote_code` must be explicitly enabled (default is false) for security. Only enable this if you trust the model's remote code.

### Usage Statistics

Chat completion responses include usage statistics with:

* `prompt_tokens` - Number of tokens in the prompt
* `completion_tokens` - Number of tokens generated
* `total_tokens` - Sum of prompt and completion tokens

### Request Cancellation

You can cancel active requests by:

1. Closing the HTTP connection (for streaming requests)
2. Calling `/v1/cancel/{command_id}` (for any request)

The server detects cancellation and stops processing immediately.

### Instance Placement

The instance placement endpoints allow you to plan and preview cluster allocations before creating instances. This helps optimize resource usage across nodes.

### Observability

The `/events` and `/state` endpoints are primarily intended for operational visibility and debugging.