mirror of
https://github.com/exo-explore/exo.git
synced 2025-12-23 22:27:50 -05:00
readme tweaks5 (#954)
## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->
This commit is contained in:
100
README.md
100
README.md
@@ -189,6 +189,106 @@ curl -X DELETE http://localhost:52415/instance/YOUR_INSTANCE_ID
|
||||
For further details, see API types and endpoints in `src/exo/master/api.py`.
|
||||
|
||||
|
||||
---
|
||||
|
||||
### Using the API
|
||||
|
||||
If you prefer to interact with exo via the API, here is an example creating an instance of a small model (`mlx-community/Llama-3.2-1B-Instruct-4bit`), sending a chat completions request and deleting the instance.
|
||||
|
||||
---
|
||||
|
||||
**1. Preview instance placements**
|
||||
|
||||
The `/instance/previews` endpoint will preview all valid placements for your model.
|
||||
|
||||
```bash
|
||||
curl "http://localhost:52415/instance/previews?model_id=llama-3.2-1b"
|
||||
```
|
||||
|
||||
Sample response:
|
||||
|
||||
```json
|
||||
{
|
||||
"previews": [
|
||||
{
|
||||
"model_id": "mlx-community/Llama-3.2-1B-Instruct-4bit",
|
||||
"sharding": "Pipeline",
|
||||
"instance_meta": "MlxRing",
|
||||
"instance": {...},
|
||||
"memory_delta_by_node": {"local": 729808896},
|
||||
"error": null
|
||||
}
|
||||
// ...possibly more placements...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This will return all valid placements for this model. Pick a placement that you like.
|
||||
To pick the first one, pipe into `jq`:
|
||||
|
||||
```bash
|
||||
curl "http://localhost:52415/instance/previews?model_id=llama-3.2-1b" | jq -c '.previews[] | select(.error == null) | .instance' | head -n1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**2. Create a model instance**
|
||||
|
||||
Send a POST to `/instance` with your desired placement in the `instance` field (the full payload must match types as in `CreateInstanceParams`), which you can copy from step 1:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:52415/instance \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"instance": {...}
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
Sample response:
|
||||
|
||||
```json
|
||||
{
|
||||
"message": "Command received.",
|
||||
"command_id": "e9d1a8ab-...."
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**3. Send a chat completion**
|
||||
|
||||
Now, make a POST to `/v1/chat/completions` (the same format as OpenAI's API):
|
||||
|
||||
```bash
|
||||
curl -N -X POST http://localhost:52415/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
|
||||
"messages": [
|
||||
{"role": "user", "content": "What is Llama 3.2 1B?"}
|
||||
],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**4. Delete the instance**
|
||||
|
||||
When you're done, delete the instance by its ID (find it via `/state` or `/instance` endpoints):
|
||||
|
||||
```bash
|
||||
curl -X DELETE http://localhost:52415/instance/YOUR_INSTANCE_ID
|
||||
```
|
||||
|
||||
**Other useful API endpoints*:**
|
||||
|
||||
- List all models: `curl http://localhost:52415/models`
|
||||
- Inspect instance IDs and deployment state: `curl http://localhost:52415/state`
|
||||
|
||||
For further details, see API types and endpoints in [src/exo/master/api.py](src/exo/master/api.py).
|
||||
|
||||
---
|
||||
|
||||
## Hardware Accelerator Support
|
||||
|
||||
Reference in New Issue
Block a user