LocalAI/docs/content/features/cloud-proxy.md at 3c9b9529c04575d12eaeeeab4e7cef6da83edb5e

mirror of https://github.com/mudler/LocalAI.git synced 2026-05-29 11:07:18 -04:00

Files

Richard Palethorpe 6a80e23733 feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802 )

Add a routing middleware stack and a cloud-proxy backend.

* cloud-proxy: a Go gRPC backend that forwards OpenAI- and
  Anthropic-shaped chat requests to upstream providers, with an
  optional translate mode (OpenAI request -> Anthropic /v1/messages
  -> OpenAI response) and full tool-calling support.

* routing: admission control, content-aware model routing
  (embedding cache + classifier + rerank + Arch-Router score),
  PII detection/redaction (regex + NER) with streaming filter and
  OpenAI/Anthropic adapters, and a per-user/per-key billing recorder
  backed by GORM or in-memory storage.

* middleware: UsageMiddleware records usage via the billing recorder,
  plus admission, route-model, usage-stamp and trace middlewares.

* observability: BackendTrace ring buffer stores full request bodies
  (capped), MITM proxy emits structured trace events, and router
  classifier decisions surface at /api/router/decide.

* gallery: Arch-Router-1.5B (Q4_K_M and Q8_0).

* UI: cloud-proxy model-editor fields, classifier system-prompt and
  score-normalization config, and a Traces page rendering request
  bodies.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

Signed-off-by: Richard Palethorpe <io@richiejp.com>

2026-05-25 09:28:27 +02:00

8.0 KiB

Raw Blame History

+++ title = "Cloud passthrough proxy" weight = 28 toc = true description = "Forward requests to OpenAI, Anthropic, or any compatible provider" tags = ["Proxy", "Cloud", "Routing", "Advanced"] categories = ["Features"] +++

LocalAI can forward chat-completion and Anthropic Messages requests to an external provider instead of running them through the local gRPC backend pipeline. Configure a model with backend: cloud-proxy and a proxy.upstream_url, and LocalAI bypasses templating, MCP injection, and the local model loader entirely — the upstream sees the body the client sent (with only the top-level model field optionally rewritten).

The streaming PII filter still runs over the upstream's SSE stream, so cloud egress remains subject to the same redaction rules a local model would apply.

When to use this

Mix local and cloud models in the same LocalAI instance — clients hit one endpoint, LocalAI dispatches per model.
Apply LocalAI's auth, usage tracking, and PII redaction to cloud traffic before the body leaves the network.
Use the intelligent router to send small or simple prompts to a local model and complex ones to Claude or GPT-4o.

How it works

Request hits LocalAI on /v1/chat/completions (OpenAI-shaped) or /v1/messages (Anthropic-shaped).
The standard auth and routing middleware runs.
Per-model PII redaction runs request-side as it would for any model.
The handler detects the cloud-proxy backend in passthrough mode and loads the cloud-proxy gRPC backend, which owns the outbound HTTP.
The backend POSTs the body to proxy.upstream_url with provider-aware authentication, then streams the SSE response back to core.
The streaming PII filter rewrites per-token text in flight; the upstream's event names and metadata pass through unchanged.

Passthrough mode is wire-format-faithful — it does not translate request shapes between providers. A client posting an OpenAI-shaped body to an Anthropic upstream will get a confused upstream. Use the matching wire format, or switch to translate mode (below).

Configuration

The cloud-proxy backend has one knob — the provider it should authenticate against — and two modes:

`proxy.mode`	What it does	When to use
`passthrough` (default)	Forwards the request body verbatim to `upstream_url`. Client must speak the upstream's wire format.	Same wire format on both ends.
`translate`	Backend converts internal proto to the upstream's wire format. Client can speak OpenAI-shaped requests to an Anthropic upstream, etc.	Cross-format adaptation.

proxy.provider selects the auth scheme and (in translate mode) the wire format. Supported values: openai, anthropic.

API keys are loaded from either an environment variable (api_key_env) or a file (api_key_file). The key never appears in the config file or the admin UI; pick whichever fits your secret-management setup.

OpenAI passthrough

name: gpt-4o-proxy
backend: cloud-proxy

# When set, replaces the client's "model" field before forwarding.
# Useful when the LocalAI alias differs from the upstream's canonical name.
proxy:
  mode: passthrough
  provider: openai
  upstream_url: https://api.openai.com/v1/chat/completions
  api_key_env: OPENAI_API_KEY
  upstream_model: gpt-4o
  request_timeout_seconds: 120

# PII filtering defaults to ON for cloud-proxy backends. Override by setting
# pii.enabled: false explicitly. Per-pattern action overrides go in
# pii.patterns; see the Middleware admin page or the Middleware feature doc.
pii:
  enabled: true

Then start LocalAI with the API key in the environment:

export OPENAI_API_KEY=sk-...
local-ai run

Clients hit http://localhost:8080/v1/chat/completions with "model": "gpt-4o-proxy" and the request lands on OpenAI's API.

Anthropic passthrough

name: claude-sonnet-proxy
backend: cloud-proxy

proxy:
  mode: passthrough
  provider: anthropic
  upstream_url: https://api.anthropic.com/v1/messages
  api_key_env: ANTHROPIC_API_KEY
  upstream_model: claude-3-5-sonnet-20241022
  request_timeout_seconds: 300

pii:
  enabled: true
  # Block — not just mask — leaked credentials before they reach the upstream.
  patterns:
    - id: api_key_prefix
      action: block

Anthropic clients hit http://localhost:8080/v1/messages with "model": "claude-sonnet-proxy".

Other OpenAI-compatible providers

Most third-party providers (Together, Groq, DeepInfra, OpenRouter, …) speak the OpenAI chat-completions wire format. Use provider: openai with the provider's URL and API key:

name: llama-3-70b-via-together
backend: cloud-proxy

proxy:
  mode: passthrough
  provider: openai
  upstream_url: https://api.together.xyz/v1/chat/completions
  api_key_env: TOGETHER_API_KEY
  upstream_model: meta-llama/Llama-3-70b-chat-hf

Translate mode

In translate mode the cloud-proxy backend converts LocalAI's internal proto to the provider's wire format. This lets a client speak one shape (e.g. OpenAI Chat Completions) against an upstream that expects another (e.g. Anthropic Messages).

name: claude-via-openai-clients
backend: cloud-proxy

proxy:
  mode: translate
  provider: anthropic
  upstream_url: https://api.anthropic.com/v1/messages
  api_key_env: ANTHROPIC_API_KEY
  upstream_model: claude-3-5-sonnet-20241022

Translate mode currently routes only pure-text completions — tool calls, image blocks, and per-request usage tokens are dropped through the internal Predict() signature. Use passthrough mode when your clients need the upstream's full feature set.

Loading secrets from a file

api_key_file is an alternative to api_key_env when your secret manager mounts keys as files (e.g. Kubernetes secrets, Docker secrets, Vault Agent):

proxy:
  api_key_file: /run/secrets/openai_api_key

The file is read at backend load time and trimmed of surrounding whitespace. api_key_env and api_key_file are mutually exclusive.

Combining with the intelligent router

A router model can spread traffic across local and cloud candidates. The score classifier reads the policy descriptions and routes per request:

name: smart-router
router:
  classifier: score
  classifier_model: arch-router-1.5b
  fallback: qwen-3-7b-local
  activation_threshold: 0.40
  policies:
    - label: casual
      description: small talk, greetings, short answers
    - label: code
      description: writing or debugging code in any programming language
    - label: heavy-reasoning
      description: long-form analysis, complex math, multi-step reasoning
  candidates:
    - model: qwen-3-7b-local
      labels: [casual]
    - model: gpt-4o-proxy
      labels: [casual, code]
    - model: claude-sonnet-proxy
      labels: [casual, code, heavy-reasoning]

The router rewrites input.Model to the chosen candidate; per-model PII, ACLs, and the cloud-proxy fork all run against the resolved target.

See [Middleware: PII filtering and intelligent routing]({{< relref "middleware.md" >}}) for the full router and PII-filter reference.

Limitations

Passthrough does no wire-shape translation. Use mode: translate (with the constraints documented above) or send requests that match the upstream's format.
No output-side PII for non-streaming responses. Streaming responses are filtered in flight; buffered responses pass through verbatim. Request-side PII covers both.
No retry or backoff. Transient upstream failures bubble up to the client as 502 Bad Gateway.
No request shape validation. If the upstream rejects the body, its error envelope is forwarded to the client unchanged.

Operational notes

Cloud-proxy backends load like any other gRPC backend — they consume one process per loaded model and appear in the backend management view, but they hold no GPU memory.
Usage stats and the trace log capture cloud-proxy requests like any other request. Token counts come from the upstream's usage field when present.
Set request_timeout_seconds defensively — a hung upstream otherwise ties up an HTTP handler until the client disconnects.

8.0 KiB Raw Blame History