mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 12:30:29 -04:00

Files

Andrei Onel 7a36d3968d docs: Update documentation for v1.0.68 release (#1667 )

## Motivation

Updated documentation for v1.0.68 release

## Changes

**docs/api.md:**
- Added documentation for new API endpoints: Claude Messages API
(`/v1/messages`), OpenAI Responses API (`/v1/responses`), and Ollama API
compatibility endpoints
- Documented custom model management endpoints (`POST /models/add`,
`DELETE /models/custom/{model_id}`)
- Added `enable_thinking` parameter documentation for thinking-capable
models (DeepSeek V3.1, Qwen3, GLM-4.7)
- Documented usage statistics in responses (prompt_tokens,
completion_tokens, total_tokens)
- Added streaming event format documentation for all API types
- Updated image generation section with FLUX.1-Kontext-dev support and
new dimensions (1024x1365, 1365x1024)
- Added request cancellation documentation
- Updated complete endpoint summary with all new endpoints
- Added security notes about trust_remote_code being opt-in

**README.md:**
- Updated Features section to highlight multiple API compatibility
options
- Added Environment Variables section documenting all configuration
options (EXO_MODELS_PATH, EXO_OFFLINE, EXO_ENABLE_IMAGE_MODELS,
EXO_LIBP2P_NAMESPACE, EXO_FAST_SYNCH, EXO_TRACING_ENABLED)
- Expanded "Using the API" section with examples for Claude Messages
API, OpenAI Responses API, and Ollama API
- Added custom model loading documentation with security notes
- Updated file locations to include log files and custom model cards
paths

**CONTRIBUTING.md:** 
- Added documentation for TOML model cards format and the API adapter
pattern

**docs/architecture.md:**
- Documented the adapter architecture introduced in PR #1167

Closes #1653

---------

Co-authored-by: askmanu[bot] <192355599+askmanu[bot]@users.noreply.github.com>
Co-authored-by: Evan Quiney <evanev7@gmail.com>

2026-03-06 11:32:46 +00:00

17 KiB

Raw Permalink Blame History

EXO API – Technical Reference

This document describes the REST API exposed by the EXO service, as implemented in:

src/exo/master/api.py

The API is used to manage model instances in the cluster, inspect cluster state, and perform inference using multiple API-compatible interfaces.

Base URL example:

http://localhost:52415

1. General / Meta Endpoints

Get Master Node ID

GET /node_id

Returns the identifier of the current master node.

Response (example):

{
  "node_id": "node-1234"
}

Get Cluster State

GET /state

Returns the current state of the cluster, including nodes and active instances.

Response: JSON object describing topology, nodes, and instances.

Get Events

GET /events

Returns the list of internal events recorded by the master (mainly for debugging and observability).

Response: Array of event objects.

2. Model Instance Management

Create Instance

POST /instance

Creates a new model instance in the cluster.

Request body (example):

{
  "instance": {
    "model_id": "llama-3.2-1b",
    "placement": { }
  }
}

Response: JSON description of the created instance.

Delete Instance

DELETE /instance/{instance_id}

Deletes an existing instance by ID.

Path parameters:

instance_id: string, ID of the instance to delete

Response: Status / confirmation JSON.

Get Instance

GET /instance/{instance_id}

Returns details of a specific instance.

Path parameters:

instance_id: string

Response: JSON description of the instance.

Preview Placements

GET /instance/previews?model_id=...

Returns possible placement previews for a given model.

Query parameters:

model_id: string, required

Response: Array of placement preview objects.

Compute Placement

GET /instance/placement

Computes a placement for a potential instance without creating it.

Query parameters (typical):

model_id: string
sharding: string or config
instance_meta: JSON-encoded metadata
min_nodes: integer

Response: JSON object describing the proposed placement / instance configuration.

Place Instance (Dry Operation)

POST /place_instance

Performs a placement operation for an instance (planning step), without necessarily creating it.

Request body: JSON describing the instance to be placed.

Response: Placement result.

3. Models

List Models

GET /models GET /v1/models (alias)

Returns the list of available models and their metadata.

Query parameters:

status: string (optional) - Filter by downloaded to show only downloaded models

Response: Array of model descriptors including is_custom field for custom HuggingFace models.

Add Custom Model

POST /models/add

Add a custom model from HuggingFace hub.

Request body (example):

{
  "model_id": "mlx-community/my-custom-model"
}

Response: Model descriptor for the added model.

Security note: Models with trust_remote_code enabled in their configuration require explicit opt-in (default is false) for security.

Delete Custom Model

DELETE /models/custom/{model_id}

Delete a user-added custom model card.

Path parameters:

model_id: string, ID of the custom model to delete

Response: Confirmation JSON with deleted model ID.

Search Models

GET /models/search

Search HuggingFace Hub for mlx-community models.

Query parameters:

query: string (optional) - Search query
limit: integer (default: 20) - Maximum number of results

Response: Array of HuggingFace model search results.

4. Inference / Chat Completions

OpenAI-Compatible Chat Completions

POST /v1/chat/completions

Executes a chat completion request using an OpenAI-compatible schema. Supports streaming and non-streaming modes.

Request body (example):

{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello" }
  ],
  "stream": false
}

Request parameters:

model: string, required - Model ID to use
messages: array, required - Conversation messages
stream: boolean (default: false) - Enable streaming responses
max_tokens: integer (optional) - Maximum tokens to generate
temperature: float (optional) - Sampling temperature
top_p: float (optional) - Nucleus sampling parameter
top_k: integer (optional) - Top-k sampling parameter
stop: string or array (optional) - Stop sequences
seed: integer (optional) - Random seed for reproducibility
enable_thinking: boolean (optional) - Enable thinking mode for capable models (DeepSeek V3.1, Qwen3, GLM-4.7)
tools: array (optional) - Tool definitions for function calling
logprobs: boolean (optional) - Return log probabilities
top_logprobs: integer (optional) - Number of top log probabilities to return

Response: OpenAI-compatible chat completion response.

Streaming response format: When stream=true, returns Server-Sent Events (SSE) with format:

data: {"id":"...","object":"chat.completion","created":...,"model":"...","choices":[...]}

data: [DONE]

Non-streaming response includes usage statistics:

{
  "id": "...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "llama-3.2-1b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 8,
    "total_tokens": 23
  }
}

Cancellation: You can cancel an active generation by closing the HTTP connection. The server detects the disconnection and stops processing.

Claude Messages API

POST /v1/messages

Executes a chat completion request using the Claude Messages API format. Supports streaming and non-streaming modes.

Request body (example):

{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "max_tokens": 1024,
  "stream": false
}

Streaming response format: When stream=true, returns Server-Sent Events with Claude-specific event types:

message_start - Message generation started
content_block_start - Content block started
content_block_delta - Incremental content chunk
content_block_stop - Content block completed
message_delta - Message metadata updates
message_stop - Message generation completed

Response: Claude-compatible messages response.

OpenAI Responses API

POST /v1/responses

Executes a chat completion request using the OpenAI Responses API format. Supports streaming and non-streaming modes.

Request body (example):

{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "stream": false
}

Streaming response format: When stream=true, returns Server-Sent Events with response-specific event types:

response.created - Response generation started
response.in_progress - Response is being generated
response.output_item.added - New output item added
response.output_item.done - Output item completed
response.done - Response generation completed

Response: OpenAI Responses API-compatible response.

Benchmarked Chat Completions

POST /bench/chat/completions

Same as /v1/chat/completions, but also returns performance and generation statistics.

Request body: Same schema as /v1/chat/completions.

Response: Chat completion plus benchmarking metrics including:

prompt_tps - Tokens per second during prompt processing
generation_tps - Tokens per second during generation
prompt_tokens - Number of prompt tokens
generation_tokens - Number of generated tokens
peak_memory_usage - Peak memory used during generation

Cancel Command

POST /v1/cancel/{command_id}

Cancels an active generation command (text or image). Notifies workers and closes the stream.

Path parameters:

command_id: string, ID of the command to cancel

Response (example):

{
  "message": "Command cancelled.",
  "command_id": "cmd-abc-123"
}

Returns 404 if the command is not found or already completed.

5. Ollama API Compatibility

EXO provides Ollama API compatibility for tools like OpenWebUI.

Ollama Chat

POST /ollama/api/chat POST /ollama/api/api/chat (alias) POST /ollama/api/v1/chat (alias)

Execute a chat request using Ollama API format.

Request body (example):

{
  "model": "llama-3.2-1b",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "stream": false
}

Response: Ollama-compatible chat response.

Ollama Generate

POST /ollama/api/generate

Execute a text generation request using Ollama API format.

Request body (example):

{
  "model": "llama-3.2-1b",
  "prompt": "Hello",
  "stream": false
}

Response: Ollama-compatible generation response.

Ollama Tags

GET /ollama/api/tags GET /ollama/api/api/tags (alias) GET /ollama/api/v1/tags (alias)

Returns list of downloaded models in Ollama tags format.

Response: Array of model tags with metadata.

Ollama Show

POST /ollama/api/show

Returns model information in Ollama show format.

Request body:

{
  "name": "llama-3.2-1b"
}

Response: Model details including modelfile and family.

Ollama PS

GET /ollama/api/ps

Returns list of running models (active instances).

Response: Array of active model instances.

Ollama Version

GET /ollama/api/version HEAD /ollama/ (alias) HEAD /ollama/api/version (alias)

Returns version information for Ollama API compatibility.

Response:

{
  "version": "exo v1.0"
}

6. Image Generation & Editing

Image Generation

POST /v1/images/generations

Executes an image generation request using an OpenAI-compatible schema with additional advanced_params. Supports both streaming and non-streaming modes.

Request body (example):

{
  "prompt": "a robot playing chess",
  "model": "exolabs/FLUX.1-dev",
  "n": 1,
  "size": "1024x1024",
  "stream": false,
  "response_format": "b64_json"
}

Request parameters:

prompt: string, required - Text description of the image
model: string, required - Image model ID
n: integer (default: 1) - Number of images to generate
size: string (default: "auto") - Image dimensions. Supported sizes:
- 512x512
- 768x768
- 1024x768
- 768x1024
- 1024x1024
- 1024x1536
- 1536x1024
- 1024x1365
- 1365x1024
stream: boolean (default: false) - Enable streaming for partial images
partial_images: integer (default: 0) - Number of partial images to stream during generation
response_format: string (default: "b64_json") - Either url or b64_json
quality: string (default: "medium") - Either high, medium, or low
output_format: string (default: "png") - Either png, jpeg, or webp
advanced_params: object (optional) - Advanced generation parameters

Advanced Parameters (advanced_params):

Parameter	Type	Constraints	Description
`seed`	int	>= 0	Random seed for reproducible generation
`num_inference_steps`	int	1-100	Number of denoising steps
`guidance`	float	1.0-20.0	Classifier-free guidance scale
`negative_prompt`	string	-	Text describing what to avoid in the image

Non-streaming response:

{
  "created": 1234567890,
  "data": [
    {
      "b64_json": "iVBORw0KGgoAAAANSUhEUgAA...",
      "url": null
    }
  ]
}

Streaming response format: When stream=true and partial_images > 0, returns Server-Sent Events:

data: {"type":"partial","image_index":0,"partial_index":1,"total_partials":5,"format":"png","data":{"b64_json":"..."}}

data: {"type":"final","image_index":0,"format":"png","data":{"b64_json":"..."}}

data: [DONE]

Image Editing

POST /v1/images/edits

Executes an image editing request (img2img) using FLUX.1-Kontext-dev or similar models.

Request (multipart/form-data):

image: file, required - Input image to edit
prompt: string, required - Text description of desired changes
model: string, required - Image editing model ID (e.g., exolabs/FLUX.1-Kontext-dev)
n: integer (default: 1) - Number of edited images to generate
size: string (optional) - Output image dimensions
response_format: string (default: "b64_json") - Either url or b64_json
input_fidelity: string (default: "low") - Either low or high - Controls how closely the output follows the input image
stream: string (default: "false") - Enable streaming
partial_images: string (default: "0") - Number of partial images to stream
quality: string (default: "medium") - Either high, medium, or low
output_format: string (default: "png") - Either png, jpeg, or webp
advanced_params: string (optional) - JSON-encoded advanced parameters

Response: Same format as /v1/images/generations.

Benchmarked Image Generation

POST /bench/images/generations

Same as /v1/images/generations, but also returns generation statistics.

Request body: Same schema as /v1/images/generations.

Response: Image generation plus benchmarking metrics including:

seconds_per_step - Average time per denoising step
total_generation_time - Total generation time
num_inference_steps - Number of inference steps used
num_images - Number of images generated
image_width - Output image width
image_height - Output image height
peak_memory_usage - Peak memory used during generation

Benchmarked Image Editing

POST /bench/images/edits

Same as /v1/images/edits, but also returns generation statistics.

Request: Same schema as /v1/images/edits.

Response: Same format as /bench/images/generations, including generation_stats.

List Images

GET /images

List all stored images.

Response: Array of image metadata including URLs and expiration times.

Get Image

GET /images/{image_id}

Retrieve a stored image by ID.

Path parameters:

image_id: string, ID of the image

Response: Image file with appropriate content type.

7. Complete Endpoint Summary

# General
GET     /node_id
GET     /state
GET     /events

# Instance Management
POST    /instance
GET     /instance/{instance_id}
DELETE  /instance/{instance_id}
GET     /instance/previews
GET     /instance/placement
POST    /place_instance

# Models
GET     /models
GET     /v1/models
POST    /models/add
DELETE  /models/custom/{model_id}
GET     /models/search

# Text Generation (OpenAI Chat Completions)
POST    /v1/chat/completions
POST    /bench/chat/completions

# Text Generation (Claude Messages API)
POST    /v1/messages

# Text Generation (OpenAI Responses API)
POST    /v1/responses

# Text Generation (Ollama API)
POST    /ollama/api/chat
POST    /ollama/api/api/chat
POST    /ollama/api/v1/chat
POST    /ollama/api/generate
GET     /ollama/api/tags
GET     /ollama/api/api/tags
GET     /ollama/api/v1/tags
POST    /ollama/api/show
GET     /ollama/api/ps
GET     /ollama/api/version
HEAD    /ollama/
HEAD    /ollama/api/version

# Command Control
POST    /v1/cancel/{command_id}

# Image Generation
POST    /v1/images/generations
POST    /bench/images/generations
POST    /v1/images/edits
POST    /bench/images/edits
GET     /images
GET     /images/{image_id}

8. Notes

API Compatibility

EXO provides multiple API-compatible interfaces:

OpenAI Chat Completions API - Compatible with OpenAI clients and tools
Claude Messages API - Compatible with Anthropic's Claude API format
OpenAI Responses API - Compatible with OpenAI's Responses API format
Ollama API - Compatible with Ollama and tools like OpenWebUI

Existing OpenAI, Claude, or Ollama clients can be pointed to EXO by changing the base URL.

Custom Models

You can add custom models from HuggingFace using the /models/add endpoint. Custom models are identified by the is_custom field in model list responses.

Security: Models requiring trust_remote_code must be explicitly enabled (default is false) for security. Only enable this if you trust the model's remote code.

Usage Statistics

Chat completion responses include usage statistics with:

prompt_tokens - Number of tokens in the prompt
completion_tokens - Number of tokens generated
total_tokens - Sum of prompt and completion tokens

Request Cancellation

You can cancel active requests by:

Closing the HTTP connection (for streaming requests)
Calling /v1/cancel/{command_id} (for any request)

The server detects cancellation and stops processing immediately.

Instance Placement

The instance placement endpoints allow you to plan and preview cluster allocations before creating instances. This helps optimize resource usage across nodes.

Observability

The /events and /state endpoints are primarily intended for operational visibility and debugging.

17 KiB Raw Permalink Blame History Unescape Escape

EXO API – Technical Reference

1. General / Meta Endpoints

Get Master Node ID

Get Cluster State

Get Events

2. Model Instance Management

Create Instance

Delete Instance

Get Instance

Preview Placements

Compute Placement

Place Instance (Dry Operation)

3. Models

List Models

Add Custom Model

Delete Custom Model

Search Models

4. Inference / Chat Completions

OpenAI-Compatible Chat Completions

Claude Messages API

OpenAI Responses API

Benchmarked Chat Completions

Cancel Command

5. Ollama API Compatibility

Ollama Chat

Ollama Generate

Ollama Tags

Ollama Show

Ollama PS

Ollama Version

6. Image Generation & Editing

Image Generation

Image Editing

Benchmarked Image Generation

Benchmarked Image Editing

List Images

Get Image

7. Complete Endpoint Summary

8. Notes

API Compatibility

Custom Models

Usage Statistics

Request Cancellation

Instance Placement

Observability

17 KiB

Raw Permalink Blame History