Files
LocalAI/docs/content/features/cloud-proxy.md
Richard Palethorpe 6a80e23733 feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802)
Add a routing middleware stack and a cloud-proxy backend.

* cloud-proxy: a Go gRPC backend that forwards OpenAI- and
  Anthropic-shaped chat requests to upstream providers, with an
  optional translate mode (OpenAI request -> Anthropic /v1/messages
  -> OpenAI response) and full tool-calling support.

* routing: admission control, content-aware model routing
  (embedding cache + classifier + rerank + Arch-Router score),
  PII detection/redaction (regex + NER) with streaming filter and
  OpenAI/Anthropic adapters, and a per-user/per-key billing recorder
  backed by GORM or in-memory storage.

* middleware: UsageMiddleware records usage via the billing recorder,
  plus admission, route-model, usage-stamp and trace middlewares.

* observability: BackendTrace ring buffer stores full request bodies
  (capped), MITM proxy emits structured trace events, and router
  classifier decisions surface at /api/router/decide.

* gallery: Arch-Router-1.5B (Q4_K_M and Q8_0).

* UI: cloud-proxy model-editor fields, classifier system-prompt and
  score-normalization config, and a Traces page rendering request
  bodies.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-25 09:28:27 +02:00

8.0 KiB

+++ title = "Cloud passthrough proxy" weight = 28 toc = true description = "Forward requests to OpenAI, Anthropic, or any compatible provider" tags = ["Proxy", "Cloud", "Routing", "Advanced"] categories = ["Features"] +++

LocalAI can forward chat-completion and Anthropic Messages requests to an external provider instead of running them through the local gRPC backend pipeline. Configure a model with backend: cloud-proxy and a proxy.upstream_url, and LocalAI bypasses templating, MCP injection, and the local model loader entirely — the upstream sees the body the client sent (with only the top-level model field optionally rewritten).

The streaming PII filter still runs over the upstream's SSE stream, so cloud egress remains subject to the same redaction rules a local model would apply.

When to use this

  • Mix local and cloud models in the same LocalAI instance — clients hit one endpoint, LocalAI dispatches per model.
  • Apply LocalAI's auth, usage tracking, and PII redaction to cloud traffic before the body leaves the network.
  • Use the intelligent router to send small or simple prompts to a local model and complex ones to Claude or GPT-4o.

How it works

  1. Request hits LocalAI on /v1/chat/completions (OpenAI-shaped) or /v1/messages (Anthropic-shaped).
  2. The standard auth and routing middleware runs.
  3. Per-model PII redaction runs request-side as it would for any model.
  4. The handler detects the cloud-proxy backend in passthrough mode and loads the cloud-proxy gRPC backend, which owns the outbound HTTP.
  5. The backend POSTs the body to proxy.upstream_url with provider-aware authentication, then streams the SSE response back to core.
  6. The streaming PII filter rewrites per-token text in flight; the upstream's event names and metadata pass through unchanged.

Passthrough mode is wire-format-faithful — it does not translate request shapes between providers. A client posting an OpenAI-shaped body to an Anthropic upstream will get a confused upstream. Use the matching wire format, or switch to translate mode (below).

Configuration

The cloud-proxy backend has one knob — the provider it should authenticate against — and two modes:

proxy.mode What it does When to use
passthrough (default) Forwards the request body verbatim to upstream_url. Client must speak the upstream's wire format. Same wire format on both ends.
translate Backend converts internal proto to the upstream's wire format. Client can speak OpenAI-shaped requests to an Anthropic upstream, etc. Cross-format adaptation.

proxy.provider selects the auth scheme and (in translate mode) the wire format. Supported values: openai, anthropic.

API keys are loaded from either an environment variable (api_key_env) or a file (api_key_file). The key never appears in the config file or the admin UI; pick whichever fits your secret-management setup.

OpenAI passthrough

name: gpt-4o-proxy
backend: cloud-proxy

# When set, replaces the client's "model" field before forwarding.
# Useful when the LocalAI alias differs from the upstream's canonical name.
proxy:
  mode: passthrough
  provider: openai
  upstream_url: https://api.openai.com/v1/chat/completions
  api_key_env: OPENAI_API_KEY
  upstream_model: gpt-4o
  request_timeout_seconds: 120

# PII filtering defaults to ON for cloud-proxy backends. Override by setting
# pii.enabled: false explicitly. Per-pattern action overrides go in
# pii.patterns; see the Middleware admin page or the Middleware feature doc.
pii:
  enabled: true

Then start LocalAI with the API key in the environment:

export OPENAI_API_KEY=sk-...
local-ai run

Clients hit http://localhost:8080/v1/chat/completions with "model": "gpt-4o-proxy" and the request lands on OpenAI's API.

Anthropic passthrough

name: claude-sonnet-proxy
backend: cloud-proxy

proxy:
  mode: passthrough
  provider: anthropic
  upstream_url: https://api.anthropic.com/v1/messages
  api_key_env: ANTHROPIC_API_KEY
  upstream_model: claude-3-5-sonnet-20241022
  request_timeout_seconds: 300

pii:
  enabled: true
  # Block — not just mask — leaked credentials before they reach the upstream.
  patterns:
    - id: api_key_prefix
      action: block

Anthropic clients hit http://localhost:8080/v1/messages with "model": "claude-sonnet-proxy".

Other OpenAI-compatible providers

Most third-party providers (Together, Groq, DeepInfra, OpenRouter, …) speak the OpenAI chat-completions wire format. Use provider: openai with the provider's URL and API key:

name: llama-3-70b-via-together
backend: cloud-proxy

proxy:
  mode: passthrough
  provider: openai
  upstream_url: https://api.together.xyz/v1/chat/completions
  api_key_env: TOGETHER_API_KEY
  upstream_model: meta-llama/Llama-3-70b-chat-hf

Translate mode

In translate mode the cloud-proxy backend converts LocalAI's internal proto to the provider's wire format. This lets a client speak one shape (e.g. OpenAI Chat Completions) against an upstream that expects another (e.g. Anthropic Messages).

name: claude-via-openai-clients
backend: cloud-proxy

proxy:
  mode: translate
  provider: anthropic
  upstream_url: https://api.anthropic.com/v1/messages
  api_key_env: ANTHROPIC_API_KEY
  upstream_model: claude-3-5-sonnet-20241022

Translate mode currently routes only pure-text completions — tool calls, image blocks, and per-request usage tokens are dropped through the internal Predict() signature. Use passthrough mode when your clients need the upstream's full feature set.

Loading secrets from a file

api_key_file is an alternative to api_key_env when your secret manager mounts keys as files (e.g. Kubernetes secrets, Docker secrets, Vault Agent):

proxy:
  api_key_file: /run/secrets/openai_api_key

The file is read at backend load time and trimmed of surrounding whitespace. api_key_env and api_key_file are mutually exclusive.

Combining with the intelligent router

A router model can spread traffic across local and cloud candidates. The score classifier reads the policy descriptions and routes per request:

name: smart-router
router:
  classifier: score
  classifier_model: arch-router-1.5b
  fallback: qwen-3-7b-local
  activation_threshold: 0.40
  policies:
    - label: casual
      description: small talk, greetings, short answers
    - label: code
      description: writing or debugging code in any programming language
    - label: heavy-reasoning
      description: long-form analysis, complex math, multi-step reasoning
  candidates:
    - model: qwen-3-7b-local
      labels: [casual]
    - model: gpt-4o-proxy
      labels: [casual, code]
    - model: claude-sonnet-proxy
      labels: [casual, code, heavy-reasoning]

The router rewrites input.Model to the chosen candidate; per-model PII, ACLs, and the cloud-proxy fork all run against the resolved target.

See [Middleware: PII filtering and intelligent routing]({{< relref "middleware.md" >}}) for the full router and PII-filter reference.

Limitations

  • Passthrough does no wire-shape translation. Use mode: translate (with the constraints documented above) or send requests that match the upstream's format.
  • No output-side PII for non-streaming responses. Streaming responses are filtered in flight; buffered responses pass through verbatim. Request-side PII covers both.
  • No retry or backoff. Transient upstream failures bubble up to the client as 502 Bad Gateway.
  • No request shape validation. If the upstream rejects the body, its error envelope is forwarded to the client unchanged.

Operational notes

  • Cloud-proxy backends load like any other gRPC backend — they consume one process per loaded model and appear in the backend management view, but they hold no GPU memory.
  • Usage stats and the trace log capture cloud-proxy requests like any other request. Token counts come from the upstream's usage field when present.
  • Set request_timeout_seconds defensively — a hung upstream otherwise ties up an HTTP handler until the client disconnects.