mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-03 22:07:58 -04:00
* docs: add 'how LocalAI works' architecture diagram Add a blueprint-style architecture diagram: clients -> small core (API, router, WebUI, agents) -> gRPC -> backend processes pulled on demand as OCI images. Place it on the overview page and replace the stale external architecture image on the reference page. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add blueprint diagrams across feature, distributed & getting-started docs Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under docs/static/images/diagrams/, wired into their docs pages, from an impact-vs-effort audit of the docs. Broaden the API surface on the overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama, and LocalAI's own API) and move the gRPC boundary label clear of the arrows. Pages: distributed mode (architecture, scheduling, ds4 layer-split), distributed inferencing, MLX, realtime, quantization, MCP, agents, mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face recognition, reranker, function calling, fine-tuning (recipe + jobs), diarization, audio transform, quickstart, model resolution. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add composable-core diagram to README hero Commit the composable-core card (small core + on-demand backend tiles) alongside the other diagrams and reference it from the README hero via a repo-relative path, so it renders on GitHub. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: fix composable-core connectors/badge and federated-vs-worker layout - composable-core: thicken the plug-in connectors so they read clearly, and widen the SEPARATE IMAGE badge so its text no longer overflows the box. - federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and replace the tangled node-to-node activation arrows with a clean fan-out (request split across all sharded nodes), mirroring the federated panel. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
235 lines
8.1 KiB
Markdown
235 lines
8.1 KiB
Markdown
+++
|
|
title = "Cloud passthrough proxy"
|
|
weight = 28
|
|
toc = true
|
|
description = "Forward requests to OpenAI, Anthropic, or any compatible provider"
|
|
tags = ["Proxy", "Cloud", "Routing", "Advanced"]
|
|
categories = ["Features"]
|
|
+++
|
|
|
|

|
|
|
|
LocalAI can forward chat-completion and Anthropic Messages requests to an
|
|
external provider instead of running them through the local gRPC backend
|
|
pipeline. Configure a model with `backend: cloud-proxy` and a `proxy.upstream_url`,
|
|
and LocalAI bypasses templating, MCP injection, and the local model loader
|
|
entirely — the upstream sees the body the client sent (with only the top-level
|
|
`model` field optionally rewritten).
|
|
|
|
The streaming PII filter still runs over the upstream's SSE stream, so cloud
|
|
egress remains subject to the same redaction rules a local model would apply.
|
|
|
|
## When to use this
|
|
|
|
- Mix local and cloud models in the same LocalAI instance — clients hit one
|
|
endpoint, LocalAI dispatches per model.
|
|
- Apply LocalAI's auth, usage tracking, and PII redaction to cloud traffic
|
|
before the body leaves the network.
|
|
- Use the intelligent router to send small or simple prompts to a local model
|
|
and complex ones to Claude or GPT-4o.
|
|
|
|
## How it works
|
|
|
|
1. Request hits LocalAI on `/v1/chat/completions` (OpenAI-shaped) or
|
|
`/v1/messages` (Anthropic-shaped).
|
|
2. The standard auth and routing middleware runs.
|
|
3. Per-model PII redaction runs request-side as it would for any model.
|
|
4. The handler detects the `cloud-proxy` backend in passthrough mode and
|
|
loads the cloud-proxy gRPC backend, which owns the outbound HTTP.
|
|
5. The backend POSTs the body to `proxy.upstream_url` with provider-aware
|
|
authentication, then streams the SSE response back to core.
|
|
6. The streaming PII filter rewrites per-token text in flight; the upstream's
|
|
event names and metadata pass through unchanged.
|
|
|
|
Passthrough mode is **wire-format-faithful** — it does not translate request
|
|
shapes between providers. A client posting an OpenAI-shaped body to an
|
|
Anthropic upstream will get a confused upstream. Use the matching wire format,
|
|
or switch to translate mode (below).
|
|
|
|
## Configuration
|
|
|
|
The cloud-proxy backend has one knob — the provider it should authenticate
|
|
against — and two modes:
|
|
|
|
| `proxy.mode` | What it does | When to use |
|
|
|---|---|---|
|
|
| `passthrough` (default) | Forwards the request body verbatim to `upstream_url`. Client must speak the upstream's wire format. | Same wire format on both ends. |
|
|
| `translate` | Backend converts internal proto to the upstream's wire format. Client can speak OpenAI-shaped requests to an Anthropic upstream, etc. | Cross-format adaptation. |
|
|
|
|
`proxy.provider` selects the auth scheme and (in translate mode) the wire
|
|
format. Supported values: `openai`, `anthropic`.
|
|
|
|
API keys are loaded from either an environment variable (`api_key_env`) or a
|
|
file (`api_key_file`). The key never appears in the config file or the admin
|
|
UI; pick whichever fits your secret-management setup.
|
|
|
|
### OpenAI passthrough
|
|
|
|
```yaml
|
|
name: gpt-4o-proxy
|
|
backend: cloud-proxy
|
|
|
|
# When set, replaces the client's "model" field before forwarding.
|
|
# Useful when the LocalAI alias differs from the upstream's canonical name.
|
|
proxy:
|
|
mode: passthrough
|
|
provider: openai
|
|
upstream_url: https://api.openai.com/v1/chat/completions
|
|
api_key_env: OPENAI_API_KEY
|
|
upstream_model: gpt-4o
|
|
request_timeout_seconds: 120
|
|
|
|
# PII filtering defaults to ON for cloud-proxy backends. Override by setting
|
|
# pii.enabled: false explicitly. Per-pattern action overrides go in
|
|
# pii.patterns; see the Middleware admin page or the Middleware feature doc.
|
|
pii:
|
|
enabled: true
|
|
```
|
|
|
|
Then start LocalAI with the API key in the environment:
|
|
|
|
```bash
|
|
export OPENAI_API_KEY=sk-...
|
|
local-ai run
|
|
```
|
|
|
|
Clients hit `http://localhost:8080/v1/chat/completions` with `"model": "gpt-4o-proxy"`
|
|
and the request lands on OpenAI's API.
|
|
|
|
### Anthropic passthrough
|
|
|
|
```yaml
|
|
name: claude-sonnet-proxy
|
|
backend: cloud-proxy
|
|
|
|
proxy:
|
|
mode: passthrough
|
|
provider: anthropic
|
|
upstream_url: https://api.anthropic.com/v1/messages
|
|
api_key_env: ANTHROPIC_API_KEY
|
|
upstream_model: claude-3-5-sonnet-20241022
|
|
request_timeout_seconds: 300
|
|
|
|
pii:
|
|
enabled: true
|
|
# Block — not just mask — leaked credentials before they reach the upstream.
|
|
patterns:
|
|
- id: api_key_prefix
|
|
action: block
|
|
```
|
|
|
|
Anthropic clients hit `http://localhost:8080/v1/messages` with
|
|
`"model": "claude-sonnet-proxy"`.
|
|
|
|
### Other OpenAI-compatible providers
|
|
|
|
Most third-party providers (Together, Groq, DeepInfra, OpenRouter, …) speak
|
|
the OpenAI chat-completions wire format. Use `provider: openai` with the
|
|
provider's URL and API key:
|
|
|
|
```yaml
|
|
name: llama-3-70b-via-together
|
|
backend: cloud-proxy
|
|
|
|
proxy:
|
|
mode: passthrough
|
|
provider: openai
|
|
upstream_url: https://api.together.xyz/v1/chat/completions
|
|
api_key_env: TOGETHER_API_KEY
|
|
upstream_model: meta-llama/Llama-3-70b-chat-hf
|
|
```
|
|
|
|
### Translate mode
|
|
|
|
In translate mode the cloud-proxy backend converts LocalAI's internal proto
|
|
to the provider's wire format. This lets a client speak one shape (e.g.
|
|
OpenAI Chat Completions) against an upstream that expects another (e.g.
|
|
Anthropic Messages).
|
|
|
|
```yaml
|
|
name: claude-via-openai-clients
|
|
backend: cloud-proxy
|
|
|
|
proxy:
|
|
mode: translate
|
|
provider: anthropic
|
|
upstream_url: https://api.anthropic.com/v1/messages
|
|
api_key_env: ANTHROPIC_API_KEY
|
|
upstream_model: claude-3-5-sonnet-20241022
|
|
```
|
|
|
|
Translate mode currently routes only pure-text completions — tool calls,
|
|
image blocks, and per-request usage tokens are dropped through the
|
|
internal `Predict()` signature. Use passthrough mode when your clients need
|
|
the upstream's full feature set.
|
|
|
|
## Loading secrets from a file
|
|
|
|
`api_key_file` is an alternative to `api_key_env` when your secret manager
|
|
mounts keys as files (e.g. Kubernetes secrets, Docker secrets, Vault Agent):
|
|
|
|
```yaml
|
|
proxy:
|
|
api_key_file: /run/secrets/openai_api_key
|
|
```
|
|
|
|
The file is read at backend load time and trimmed of surrounding whitespace.
|
|
`api_key_env` and `api_key_file` are mutually exclusive.
|
|
|
|
## Combining with the intelligent router
|
|
|
|
A router model can spread traffic across local and cloud candidates. The
|
|
score classifier reads the policy descriptions and routes per request:
|
|
|
|
```yaml
|
|
name: smart-router
|
|
router:
|
|
classifier: score
|
|
classifier_model: arch-router-1.5b
|
|
fallback: qwen-3-7b-local
|
|
activation_threshold: 0.40
|
|
policies:
|
|
- label: casual
|
|
description: small talk, greetings, short answers
|
|
- label: code
|
|
description: writing or debugging code in any programming language
|
|
- label: heavy-reasoning
|
|
description: long-form analysis, complex math, multi-step reasoning
|
|
candidates:
|
|
- model: qwen-3-7b-local
|
|
labels: [casual]
|
|
- model: gpt-4o-proxy
|
|
labels: [casual, code]
|
|
- model: claude-sonnet-proxy
|
|
labels: [casual, code, heavy-reasoning]
|
|
```
|
|
|
|
The router rewrites `input.Model` to the chosen candidate; per-model PII,
|
|
ACLs, and the cloud-proxy fork all run against the resolved target.
|
|
|
|
See [Middleware: PII filtering and intelligent routing]({{< relref "middleware.md" >}})
|
|
for the full router and PII-filter reference.
|
|
|
|
## Limitations
|
|
|
|
- **Passthrough does no wire-shape translation.** Use `mode: translate` (with
|
|
the constraints documented above) or send requests that match the upstream's
|
|
format.
|
|
- **No output-side PII for non-streaming responses.** Streaming responses are
|
|
filtered in flight; buffered responses pass through verbatim. Request-side
|
|
PII covers both.
|
|
- **No retry or backoff.** Transient upstream failures bubble up to the client
|
|
as `502 Bad Gateway`.
|
|
- **No request shape validation.** If the upstream rejects the body, its
|
|
error envelope is forwarded to the client unchanged.
|
|
|
|
## Operational notes
|
|
|
|
- Cloud-proxy backends load like any other gRPC backend — they consume one
|
|
process per loaded model and appear in the backend management view, but
|
|
they hold no GPU memory.
|
|
- Usage stats and the trace log capture cloud-proxy requests like any other
|
|
request. Token counts come from the upstream's `usage` field when present.
|
|
- Set `request_timeout_seconds` defensively — a hung upstream otherwise ties
|
|
up an HTTP handler until the client disconnects.
|