mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 16:49:06 -04:00
pii_default_detectors was applied to the live config only by a live POST /api/settings (ApplyRuntimeSettings) — neither the startup loader nor the config file watcher read it back. So after a restart the persisted default detectors were dropped, and the cloud-proxy MITM listener (which resolves each intercept host's detectors once at start via ResolvePIIPolicy) came up with an empty set and forwarded intercepted traffic unredacted, even though the MITM model had pii.enabled:true and the defaults were on disk. Request-side default redaction broke the same way. - startup.go: loadRuntimeSettingsFromFile now applies pii_default_detectors, before startMITMIfConfigured, with env > file precedence. - config_file_watcher.go: apply pii_default_detectors on live file edits, matching the existing env-guard pattern used for the other fields. - settings endpoint: rebuild the MITM listener when pii_default_detectors changes (its per-host detector map is frozen at listener start), not only on a mitm_listen change — so toggling a default detector takes effect on cloud-proxy traffic immediately. - new LOCALAI_PII_DEFAULT_DETECTORS env var / CLI flag (WithPIIDefaultDetectors) so the default detector set can be pinned at boot for immutable deployments. Assisted-by: Claude:claude-opus-4-8 Claude-Code Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
643 lines
29 KiB
Markdown
643 lines
29 KiB
Markdown
+++
|
||
title = "Middleware: PII filtering and intelligent routing"
|
||
weight = 27
|
||
toc = true
|
||
description = "Per-model PII redaction and policy-based request routing"
|
||
tags = ["Routing", "Privacy", "PII", "Middleware", "Advanced"]
|
||
categories = ["Features"]
|
||
+++
|
||
|
||

|
||
|
||
LocalAI ships a request-middleware layer that sits between the HTTP API and
|
||
the backend dispatcher. Two subsystems share that layer because they share
|
||
the same lifecycle hook: **PII filtering** scans the request body before it
|
||
reaches a backend, and the **intelligent router** rewrites `input.Model` so
|
||
a single client-facing model name fans out across multiple downstream
|
||
targets.
|
||
|
||
Both are inspected and configured from the same admin page
|
||
(`/app/middleware`), backed by the same REST surface (`/api/middleware/*`,
|
||
`/api/pii/*`, `/api/router/*`) and the same MCP tools.
|
||
|
||
## Request lifecycle
|
||
|
||
```
|
||
client ── auth ── route-model ── per-model PII ── backend ── client
|
||
│ │
|
||
│ └─── event log
|
||
└─── decision log
|
||
```
|
||
|
||
The router runs first (it picks the target model so per-model PII has
|
||
something to gate on), per-model PII runs next (gated by the resolved
|
||
config), and the backend executes. Filtering is **request-side only** —
|
||
the request body is scanned and rewritten before forwarding; the response
|
||
is not touched (NER over a streamed response is left as a follow-up). Each
|
||
subsystem writes to its own admin-visible log: `/api/router/decisions` for
|
||
routing, `/api/pii/events` for redaction and block actions.
|
||
|
||
---
|
||
|
||
## PII filtering
|
||
|
||
PII redaction is **NER-based and runs request-side (input)**. It is
|
||
**off by default**, flipping to **on for any `cloud-proxy` backend**
|
||
because that traffic crosses the network to a third-party provider. Pick a
|
||
[default detector](#instance-wide-defaults) so those models are actually
|
||
scanned. Explicit `pii.enabled` in a model's YAML always wins over the
|
||
backend default.
|
||
|
||
Filtering runs on every text-accepting endpoint that has an adapter wired:
|
||
`/v1/chat/completions` and `/v1/messages` (chat), `/v1/completions`,
|
||
`/v1/embeddings`, `/v1/edits`, and the Ollama `/api/chat`, `/api/generate`
|
||
and `/api/embed` endpoints, plus the [MITM proxy]({{< relref "mitm-proxy.md" >}})
|
||
request body. Image, audio (TTS/STT), video, rerank, and the realtime
|
||
WebSocket are not filtered yet (different prompt-PII semantics; realtime is
|
||
not HTTP middleware).
|
||
|
||
A request's messages are scanned **as one document** (joined in order), so
|
||
the NER detector keeps conversational context: whether `4421` is a PIN or
|
||
`jdoe_42` is a username is usually decided by the question asked in the
|
||
*previous* message, and a bidirectional encoder only sees that context when
|
||
the messages share a forward pass. Detected spans are mapped back to the
|
||
individual message they fall in, so redaction still rewrites each message
|
||
field in place and events carry message-local offsets.
|
||
|
||
> The earlier regex pattern tier (`pii.patterns`, the built-in pattern
|
||
> catalogue, `--pii-config`, the `/api/pii/patterns|test|decide` endpoints)
|
||
> and response/streaming-side redaction have been **removed**. Detection is
|
||
> now driven entirely by token-classification (NER) models. Legacy keys
|
||
> no-op with a startup warning.
|
||
|
||
### Detector models
|
||
|
||
A **detector** is a `token_classify` model (e.g. an `openai-privacy-filter`
|
||
GGUF) that carries the detection *policy* in a top-level `pii_detection:`
|
||
block — defined once, on the model itself:
|
||
|
||
```yaml
|
||
name: privacy-filter-multilingual
|
||
backend: privacy-filter
|
||
embeddings: true # TOKEN_CLS pooling
|
||
known_usecases:
|
||
- token_classify
|
||
pii_detection:
|
||
min_score: 0.5 # drop detections below this confidence
|
||
default_action: mask # applied to any detected group with no entry
|
||
entity_actions: # which PII to block vs mask vs allow-log
|
||
PASSWORD: block
|
||
CREDITCARD: block
|
||
EMAIL: mask
|
||
```
|
||
|
||
`mask` rewrites the matched span to `[REDACTED:ner:<GROUP>]` in the request
|
||
body before forwarding. `block` returns HTTP 400 (`error.type=pii_blocked`)
|
||
without forwarding. `allow` detects and logs (a PIIEvent is still recorded)
|
||
but leaves the text unchanged. The entity-group names are whatever the model
|
||
emits (the privacy-filter family uses uppercase names like `EMAIL`,
|
||
`PASSWORD`, `CREDITCARD`).
|
||
|
||
### Pattern detector tier
|
||
|
||
NER is the wrong tool for high-entropy, highly-regular **secrets** — API keys,
|
||
tokens, private-key blocks. A trained NER model has no "API key" class, so it
|
||
fragments a key into the nearest categories it *does* know and can leave the
|
||
secret part exposed. Those secrets are exactly what a regex catches cheaply.
|
||
|
||
A **pattern detector** is a detector model (`backend: pattern`) that matches
|
||
secrets with a **restricted regex subset** compiled to Go's RE2 engine —
|
||
linear-time, no backtracking, no ReDoS. It runs entirely in-process: no model
|
||
download, no backend, zero VRAM. Install the gallery's **`secret-filter`** for a
|
||
ready-made set, or define your own:
|
||
|
||
```yaml
|
||
name: secret-filter
|
||
backend: pattern
|
||
known_usecases: [token_classify] # so it appears in the detector picker
|
||
pii_detection:
|
||
default_action: block # a leaked credential shouldn't leave
|
||
builtins: # built-in catalogue (enable by name)
|
||
- anthropic_api_key
|
||
- openai_api_key
|
||
- github_token
|
||
- aws_access_key
|
||
- private_key_block
|
||
patterns: # operator-defined, restricted subset
|
||
- name: INTERNAL_TOKEN
|
||
match: "tok-[A-Za-z0-9]{32,64}"
|
||
action: block # optional per-pattern override
|
||
min_len: 36 # optional length floor
|
||
```
|
||
|
||
A match is reported under its group (built-in group name, or the pattern
|
||
`name`), so `entity_actions` / `default_action` apply exactly as for NER.
|
||
|
||
**The restricted grammar** (validated at load — an invalid pattern is rejected,
|
||
not silently ignored):
|
||
- Allowed: literals, character classes `[…]` and `\w \d \s`, alternation,
|
||
anchors `^ $ \b`, and quantifiers `? * + {m,n}`.
|
||
- Rejected: `.` (any-char), capturing groups, and `{n,m}` bounds over 4096.
|
||
- **Required anchor**: every pattern must contain a fixed literal run of at
|
||
least 3 characters (e.g. `sk-ant-`, `ghp_`, `AKIA`). This admits real key
|
||
shapes but rejects open-ended ones — an email or a bare `\w+` has no such
|
||
anchor and belongs to the [NER tier](#detector-models).
|
||
|
||
Use both tiers together: reference an NER detector *and* a pattern detector in a
|
||
model's `pii.detectors` (or as instance defaults); their hits union, and a
|
||
`block` from either rejects the request.
|
||
|
||
### Consuming models
|
||
|
||
Any model opts in by enabling PII and referencing one or more detectors —
|
||
no per-consumer policy:
|
||
|
||
```yaml
|
||
name: claude-strict
|
||
backend: cloud-proxy
|
||
proxy:
|
||
mode: passthrough
|
||
provider: anthropic
|
||
upstream_url: https://api.anthropic.com/v1/messages
|
||
api_key_env: ANTHROPIC_API_KEY
|
||
pii:
|
||
enabled: true # default-on for cloud-proxy; explicit for audit
|
||
detectors:
|
||
- privacy-filter-multilingual
|
||
```
|
||
|
||
Multiple detectors **union** their detections; overlapping spans resolve to
|
||
the strongest action (`block` > `mask` > `allow`). A configured detector
|
||
that can't be loaded **fails the request closed** (HTTP 503,
|
||
`error.type=pii_ner_unavailable`) rather than silently skipping the check.
|
||
The same NER path runs on the [MITM proxy]({{< relref "mitm-proxy.md" >}})
|
||
request body for intercepted hosts. Response/output redaction is out of
|
||
scope for now.
|
||
|
||
### Instance-wide default detector
|
||
|
||
The **Detector models** table on the Middleware → Filtering page lists every
|
||
`token_classify` detector model (neural NER models and in-process pattern
|
||
matchers alike) and exposes a per-row **Default** toggle. Toggling a detector
|
||
on adds it to the instance-wide default detector set — one or more models
|
||
applied to any PII-enabled model that names none of its own `pii.detectors`.
|
||
It is persisted through `POST /api/settings` and read live, so a change takes
|
||
effect on the next request without a restart. A default that names a model no
|
||
longer loaded still appears (marked *not loaded*) so it can be toggled off.
|
||
|
||
The default set can also be supplied out-of-band with the
|
||
`LOCALAI_PII_DEFAULT_DETECTORS` environment variable (comma-separated model
|
||
names, e.g. `privacy-filter-nemotron,secret-filter`). When set it takes
|
||
precedence over the value persisted via the UI (env > file), which is the
|
||
right behaviour for immutable container deployments that pin filtering policy
|
||
at boot rather than via the admin UI.
|
||
|
||
This is what makes `cloud-proxy` / MITM redaction work out of the box: those
|
||
backends default to PII-enabled but ship no detector list, so without a
|
||
default detector the filter runs with nothing to scan. Set one here and
|
||
cloud-proxy traffic is scanned with no per-model config.
|
||
|
||
Resolution precedence (the single decision point is `ResolvePIIPolicy`,
|
||
shared by the chat middleware and the MITM listener so both agree):
|
||
|
||
1. An explicit `pii.enabled` on the model wins — `true` or `false`.
|
||
2. Otherwise PII is on if the backend defaults it on (`cloud-proxy`).
|
||
3. Detectors are the model's own `pii.detectors`; if it lists none, the
|
||
instance-wide default detector(s) are used.
|
||
|
||
A model that resolves enabled but ends up with no detector at all (a
|
||
cloud-proxy model with no model detectors and no instance default) scans
|
||
nothing — set a default detector to close that gap.
|
||
|
||
### Admin page
|
||
|
||
The `/app/middleware` page (admin role only) has four tabs — **Filtering**,
|
||
**Routing**, **MITM Proxy** (see the [MITM doc]({{< relref "mitm-proxy.md" >}})),
|
||
and **Events**. The Filtering tab has a **Detector models** table (every
|
||
`token_classify` filter model, with the per-row Default toggle above and an
|
||
edit link to each detector's config, plus an *Add detector model* button) and
|
||
a per-model table listing only the models PII can actually apply to — chat /
|
||
completion / embeddings / edit consumers and cloud-proxy models, not
|
||
VAD/STT/image models or the detector models themselves. Each row reports the
|
||
**effective** `enabled` state as an inline **toggle** — flipping it writes an
|
||
explicit `pii.enabled` to that model's YAML (a server-side deep-merge that
|
||
preserves `pii.detectors` and every other field), so a cloud-proxy model shown
|
||
on by backend default can be turned off, and vice-versa — plus the
|
||
resolved detector(s) — with a *(default)* marker when they come from the
|
||
instance-wide default rather than the model's YAML — why it is on (`YAML` /
|
||
`backend default`), and the recent event count. Detection *policy*
|
||
(entity→action, min score) is still edited on each detector model's config
|
||
(Models → edit → PII), not globally.
|
||
|
||
### Analyze / redact API
|
||
|
||
The same detection pipeline is also exposed as a standalone service, so a
|
||
client can scan or sanitise a string **without** routing a full chat request
|
||
through it (the inline path above). Two endpoints, both requiring a normal API
|
||
key (the `pii_filter` feature — not admin):
|
||
|
||
- `POST /api/pii/analyze` — detect only. Returns the matched entity spans
|
||
(`entity_type`, `source` `ner`|`pattern`, `start`/`end`, `score`, `action`)
|
||
and a `blocked` flag, **without modifying the text**.
|
||
- `POST /api/pii/redact` — apply the configured policy. Returns `redacted_text`
|
||
(with masked spans replaced by `[REDACTED:<id>]`) and `masked`; when a `block`
|
||
action fires it returns `400` with `type: pii_blocked` and the offending
|
||
entities — never a redacted body.
|
||
|
||
Both take the same request: `text` plus a detector selection — either explicit
|
||
detector model names in `detectors`, or a consuming `model` whose **effective**
|
||
policy is used: the model's own `pii.detectors`, else the
|
||
[instance-wide default detectors](#instance-wide-default-detector), exactly as
|
||
the inline filter resolves them. A `model` with PII disabled — or enabled but
|
||
with no detector anywhere — is a `400`: the inline filter would scan nothing
|
||
for it, and the API says so rather than implying a clean scan. The detection
|
||
policy lives on the detector models exactly as for the inline filter. The raw
|
||
matched value is never returned (an admin may pass `reveal: true` to include
|
||
the audit `hash_prefix`).
|
||
|
||
`text` is scanned as a single document. To reproduce the inline filter's
|
||
conversation-context behaviour for multi-message content, join the messages
|
||
with blank lines into one `text` — NER detection quality depends on that
|
||
context (a bare `4421` is nothing; after "what are the last four digits of
|
||
your card?" it is a PIN).
|
||
|
||
```bash
|
||
# Redact with an explicit pattern/NER detector
|
||
curl -sX POST http://localhost:8080/api/pii/redact \
|
||
-H 'Authorization: Bearer $API_KEY' -H 'Content-Type: application/json' \
|
||
-d '{"text":"reach me at jane@acme.io","detectors":["my-ner-model"]}'
|
||
# => {"redacted_text":"reach me at [REDACTED:ner:EMAIL]","masked":true,...}
|
||
|
||
# Analyze using a consuming model's configured detectors
|
||
curl -sX POST http://localhost:8080/api/pii/analyze \
|
||
-H 'Authorization: Bearer $API_KEY' -H 'Content-Type: application/json' \
|
||
-d '{"text":"sk-ant-api03-…","model":"gpt-4"}'
|
||
# => {"entities":[{"entity_type":"ANTHROPIC_KEY","source":"pattern",...,"action":"block"}],"blocked":true}
|
||
```
|
||
|
||
Calls are audited in the same event log, tagged with an `origin` of
|
||
`pii_analyze` / `pii_redact` (the inline filter records `middleware`, the MITM
|
||
proxy records `proxy`), so `GET /api/pii/events?origin=pii_redact` shows just
|
||
the redact-API rows.
|
||
|
||
### REST surface
|
||
|
||
| Method | Path | Auth | Purpose |
|
||
|---|---|---|---|
|
||
| POST | `/api/pii/analyze` | api key (`pii_filter`) | Detect PII in a string; returns entity spans, no mutation. |
|
||
| POST | `/api/pii/redact` | api key (`pii_filter`) | Redact a string per policy; returns `redacted_text` or `400 pii_blocked`. |
|
||
| GET | `/api/pii/events` | admin | Recent middleware events — PII redactions, MITM connect/traffic, admission denials. Filterable by `correlation_id`, `user_id`, `pattern_id` (e.g. `ner:EMAIL`), `kind`, `origin`. |
|
||
| GET | `/api/middleware/status` | admin | Aggregated dashboard data: per-model PII state + detectors + router status + MITM status + admission status. One round-trip for the UI. |
|
||
|
||
### MCP tools
|
||
|
||
The same surface is mirrored through the LocalAI Assistant MCP server:
|
||
|
||
| Tool | Read/Write | Purpose |
|
||
|---|---|---|
|
||
| `get_pii_events` | read | Recent redaction / block events with optional filters. |
|
||
| `get_middleware_status` | read | Aggregator — the same payload as `GET /api/middleware/status`. |
|
||
|
||
Detection policy is part of a detector model's config, so it is managed
|
||
through the model-config tools (`edit_model_config`), not a dedicated PII
|
||
tool.
|
||
|
||
---
|
||
|
||
## Intelligent routing
|
||
|
||
A **router model** is a model whose YAML carries a `router:` block. When
|
||
a client addresses it (`"model": "smart-router"`), the middleware
|
||
classifies the prompt, picks a downstream candidate model, rewrites
|
||
`input.Model` to the candidate, and the standard model-resolution path
|
||
runs against that resolved target. ACL checks, disabled-state, and
|
||
per-model PII all apply to the resolved model — the router does
|
||
*model selection only*.
|
||
|
||
#### Depth-1 invariant
|
||
|
||
Candidates **must not** themselves be router models. A
|
||
`smart-router → claude-strict → cloud-proxy` chain is fine
|
||
(`claude-strict` is a regular cloud-proxy model). A
|
||
`smart-router → other-router → real-model` chain is rejected at runtime
|
||
by the middleware (the dispatcher returns HTTP 500 with a
|
||
`depth-1 invariant` error). This keeps the dispatch graph acyclic and
|
||
predictable.
|
||
|
||
#### Fallback
|
||
|
||
If no candidate's label set covers the active label set from the classifier,
|
||
or the classifier errors out, the router uses `cfg.Router.Fallback`.
|
||
An empty `fallback` causes the dispatch to fail with HTTP 500 rather
|
||
than silently routing somewhere unintended — fail-fast, not
|
||
silent-bypass.
|
||
|
||
### Available classifiers
|
||
|
||
LocalAI ships two classifier implementations. Pick one with `classifier:`
|
||
in the router YAML:
|
||
|
||
| Classifier | When to use | Underlying primitive |
|
||
|---|---|---|
|
||
| `score` (default) | Small classifier-tuned LM (Arch-Router-style). Best when label vocabulary is well-covered by next-token continuation. | `Score` gRPC primitive (llama-cpp, vLLM). |
|
||
| `colbert` | When label descriptions are abstract or short and a next-token classifier produces flat distributions. Robust on long-form policy descriptions. | rerankers backend in ColBERT mode (e.g. `bge-m3-colbert` from the gallery). |
|
||
|
||
Both classifiers share the same YAML shape: `classifier_model`,
|
||
`policies`, `candidates`, `fallback`, `activation_threshold`,
|
||
`classifier_cache_size`, and the optional `embedding_cache` block.
|
||
|
||
### The Score classifier
|
||
|
||
The `score` classifier works like this:
|
||
|
||
1. Build a Qwen/ChatML system prompt that lists every policy label with
|
||
its description and primes the model to emit a label as the assistant
|
||
turn.
|
||
2. Ask the classifier model to **score every policy label** as the
|
||
first-token(s) continuation. This uses the `Score` gRPC primitive
|
||
(`backend.proto::Score`), which returns per-candidate log-probabilities
|
||
length-normalized so candidates of unequal token length stay
|
||
comparable.
|
||
3. Softmax the length-normalized log-probabilities into a probability
|
||
distribution over labels.
|
||
4. Threshold the distribution: every label whose probability passes
|
||
`activation_threshold` joins the **active label set**.
|
||
5. Pick the FIRST candidate whose `Labels` is a superset of the active
|
||
set. Admins order candidates smallest → largest so a single-label
|
||
query routes to the smallest capable model, while a query that
|
||
activates multiple labels falls to a candidate that covers them all.
|
||
|
||
This is the Arch-Router approach extended for multi-label. The
|
||
distribution carries more signal than the argmax — reading off the
|
||
spread lets one prompt activate multiple policies and route to a model
|
||
capable of all of them.
|
||
|
||
#### Recommended classifier model
|
||
|
||
[Arch-Router-1.5B](https://huggingface.co/katanemo/Arch-Router-1.5B) is
|
||
the canonical choice. It's a Qwen-2.5-1.5B-Instruct base trained
|
||
specifically on routing-policy continuation, so the ChatML system-prompt
|
||
+ label-continuation pattern produces well-separated label probabilities
|
||
without prompt tuning. The Q4_K_M GGUF runs on CPU, GPU, and Intel SYCL.
|
||
|
||
The classifier model must support the `Score` gRPC primitive (today: the
|
||
llama-cpp and vLLM backends) and use the ChatML chat template. Any small
|
||
ChatML instruct model works under those constraints, but expect flatter
|
||
probability distributions which translate to a higher
|
||
`activation_threshold` to keep noise out of the active label set.
|
||
|
||
On llama-cpp, scoring rides the server's task queue alongside
|
||
generation and embeddings, so the classifier may share a model config
|
||
with `chat`/`completion`/`embeddings` — a dedicated scorer model is no
|
||
longer required. Repeated calls with the same prompt also reuse the
|
||
prompt's KV cache across candidates.
|
||
|
||
### The Colbert classifier
|
||
|
||
The `colbert` classifier reranks each policy *description* against the
|
||
prompt via the rerankers backend and activates the labels whose
|
||
relevance scores clear `activation_threshold` (default 0.5 for
|
||
reranker-style scores in [0, 1]).
|
||
|
||
```yaml
|
||
router:
|
||
classifier: colbert
|
||
classifier_model: bge-m3-colbert # gallery entry; loads BAAI/bge-m3 in ColBERT mode
|
||
activation_threshold: 0.5
|
||
policies:
|
||
- label: code-generation
|
||
description: writing, debugging, reading, or explaining code
|
||
- label: casual-chat
|
||
description: small talk, greetings, jokes
|
||
candidates: [...]
|
||
```
|
||
|
||
The reranker scores the *description* (natural English) rather than
|
||
asking a small LM to score the *label* as a next-token continuation,
|
||
so it tends to be more robust when policy labels are abstract slugs
|
||
(`compliance-review`, `tier-2-support`). The trade-off is one
|
||
reranker round-trip per request — bge-m3 in ColBERT mode is fast
|
||
enough on GPU that this is comparable to the Score path for most
|
||
workloads. The `embedding_cache` block applies identically.
|
||
|
||
The reranker model's `type:` (in the model YAML) selects which
|
||
underlying scoring head loads — `colbert` for late-interaction MaxSim,
|
||
`cross-encoder` for cross-attention scoring. The classifier itself is
|
||
indifferent; pick the head that fits your latency / quality budget.
|
||
|
||
### YAML reference
|
||
|
||
```yaml
|
||
name: smart-router
|
||
known_usecases:
|
||
- chat
|
||
router:
|
||
# `score` (Arch-Router-style next-token scoring) or `colbert`
|
||
# (rerank policy descriptions). See "Available classifiers" above.
|
||
classifier: score
|
||
|
||
# A model loaded by LocalAI that supports the Score gRPC primitive
|
||
# (llama-cpp and vLLM ship implementations). Arch-Router-1.5B is the
|
||
# canonical choice.
|
||
classifier_model: arch-router-1.5b
|
||
|
||
# Bounded LRU keyed on (case-folded, whitespace-trimmed) prompt — prompts
|
||
# repeat in agent loops; the cache amortises the classifier round-trip
|
||
# across them. 0 here means "use the default" (1024); the cache cannot be
|
||
# disabled from YAML today.
|
||
classifier_cache_size: 256
|
||
|
||
# Softmax probability floor a label must clear to join the active label set.
|
||
# 0 = use the package default (0.15). 0.40 is a better empirical
|
||
# starting point on Arch-Router-1.5B — see the tuning note below.
|
||
activation_threshold: 0.40
|
||
|
||
# Used when no candidate covers the active label set, or the classifier
|
||
# itself errors. Empty here = fail-fast with HTTP 500.
|
||
fallback: qwen3-0.6b
|
||
|
||
# The label vocabulary. Descriptions are fed verbatim into the
|
||
# classifier's system prompt — short, action-oriented sentences work
|
||
# best ("writing or debugging code", "small talk").
|
||
policies:
|
||
- label: code-generation
|
||
description: writing, debugging, reading, or explaining code in any programming language
|
||
- label: casual-chat
|
||
description: small talk, greetings, jokes, or general conversation with no specific task
|
||
- label: math-reasoning
|
||
description: arithmetic, equations, percentage calculations, or step-by-step word problems
|
||
|
||
# Routing table — order matters (smallest → largest). See "Score
|
||
# classifier" above for the matching rule.
|
||
candidates:
|
||
- model: qwen3-0.6b
|
||
labels: [casual-chat]
|
||
- model: qwen_qwen3.5-2b
|
||
labels: [code-generation, casual-chat, math-reasoning]
|
||
```
|
||
|
||
### Tuning `activation_threshold`
|
||
|
||
The threshold is the single knob you'll want to tune per
|
||
(classifier-model, policy-set) pair. On Arch-Router-1.5B with the
|
||
three-policy setup above, sweeping the threshold over a hand-labeled
|
||
30-prompt corpus produced:
|
||
|
||
| Threshold | Label-set accuracy | End-to-end routing accuracy |
|
||
|---:|---:|---:|
|
||
| 0.15 (package default) | 30% | 73% |
|
||
| 0.30 | 57% | 87% |
|
||
| **0.40** | **60%** | **90%** |
|
||
| 0.45 | 67% | 97% |
|
||
| 0.50 | 67% | 97% |
|
||
|
||
The classifier's argmax matches the dominant label 93% of the time on
|
||
this corpus — what the threshold controls is how much secondary-label
|
||
noise leaks into the active label set. Low thresholds push single-label
|
||
queries to multi-label-capable (larger) candidates unnecessarily; 0.40
|
||
keeps the dominant label dominant without losing genuine compound
|
||
activations.
|
||
|
||
Re-tune per (classifier-model, policy-set) pair. The `/api/score`
|
||
endpoint (see below) is the convenient probe — it returns the raw
|
||
length-normalized log-probabilities so you can sweep thresholds offline
|
||
without driving real chat completions.
|
||
|
||
### Embedding cache (L2)
|
||
|
||
Classification is the most expensive thing the middleware does. The
|
||
score classifier already memo-caches verbatim repeats (case- and
|
||
whitespace-folded prompt → decision); the **embedding cache** is the
|
||
L2 tier that catches *semantically similar* prompts — "How do I exit
|
||
vim?" and "i need to quit vim" can share a decision instead of running
|
||
the classifier twice.
|
||
|
||
Pairs naturally with a larger / slower classifier model: the steady-state
|
||
cost on cache hits collapses to one embedding round-trip plus a KNN
|
||
search, both well under 100ms with `nomic-embed-text-v1.5` + local-store.
|
||
|
||
#### Configuration
|
||
|
||
Add an `embedding_cache:` block to a router model:
|
||
|
||
```yaml
|
||
router:
|
||
classifier: score
|
||
classifier_model: arch-router-1.5b
|
||
policies: [...]
|
||
candidates: [...]
|
||
|
||
embedding_cache:
|
||
embedding_model: nomic-embed-text-v1.5 # any loaded embedding model
|
||
similarity_threshold: 0.80 # cosine sim floor for a hit (default 0.80)
|
||
confidence_threshold: 0.60 # min top-label prob to cache a decision (default 0.60)
|
||
# store_name: router-cache-smart-router # optional override; defaults to "router-cache-<router>"
|
||
```
|
||
|
||
Omit the block entirely to disable. The cache adds two new failure modes
|
||
(embedder unavailable, store unavailable) — both fall through to the
|
||
inner classifier so routing keeps working.
|
||
|
||
#### How it works
|
||
|
||
For each request:
|
||
|
||
1. Embed the probe prompt via the configured `embedding_model`.
|
||
2. KNN top-1 against the per-router local-store collection.
|
||
3. If similarity ≥ `similarity_threshold`, return the cached decision
|
||
(`Cached=true`, `CacheSimilarity=<sim>` in the decision log).
|
||
4. Miss → run the inner classifier. If `decision.score >= confidence_threshold`,
|
||
insert `(embedding, decision)` into the store. Low-confidence
|
||
decisions are deliberately skipped so they can't poison future
|
||
paraphrases.
|
||
|
||
The local-store collection is named `router-cache-<router-model-name>` by
|
||
default — each router gets its own collection so two routers can't
|
||
cross-contaminate. Collections persist on disk (local-store is the
|
||
canonical persistent vector backend), so the cache survives restarts.
|
||
|
||
#### Tuning notes
|
||
|
||
- **Similarity threshold**: 0.80 is the package default — re-tune
|
||
per (embedding model, corpus). The histogram on the Routing tab
|
||
shows where the cosine distribution actually sits; pick a
|
||
threshold above the cross-intent cluster and below the paraphrase
|
||
cluster.
|
||
- **Confidence threshold**: 0.60 corresponds roughly to "the
|
||
classifier is committed to a top label." Don't lower this — caching
|
||
unsure decisions propagates the uncertainty.
|
||
- **Cache flush**: invalidates automatically when the router YAML
|
||
changes (the classifier cache is fingerprinted by `yaml.Marshal`),
|
||
but the underlying local-store collection still holds the old
|
||
payloads. Manual flush via local-store admin or by renaming
|
||
`store_name` if you need a hard reset.
|
||
- **Latency budget**: an embedding round-trip (typically 30–80ms for
|
||
small embedding models) plus KNN search (~5ms) is added to every
|
||
*miss* on top of the classifier latency. Cache hits skip the
|
||
classifier entirely. Break-even is around 7–10% hit rate; agent
|
||
loops with repeated phrasing easily exceed this.
|
||
|
||
### Admin page
|
||
|
||
The `/app/middleware` page has a **Routing** tab listing every router
|
||
model's classifier, policies, candidates, and fallback. The **Events**
|
||
tab shows the decision log — one row per classified request with
|
||
correlation ID, requested model, served model, classifier name, active
|
||
labels, top-label score, and latency.
|
||
|
||
Routing decisions are stored in an in-process ring buffer (default
|
||
capacity 5,000). The decision log is for audit and tuning — the
|
||
canonical usage log lives in `/api/usage` and correlates by request ID.
|
||
|
||
### REST surface
|
||
|
||
| Method | Path | Auth | Purpose |
|
||
|---|---|---|---|
|
||
| GET | `/api/router/status` | any | Router configuration: each router model's classifier, policies, candidates. |
|
||
| GET | `/api/router/decisions` | admin | Decision log with optional filters (`correlation_id`, `user_id`, `router_model`, `limit`). |
|
||
| POST | `/api/score` | admin | Direct access to the `Score` gRPC primitive — useful for offline threshold tuning. Body: `{"model": "<classifier-model>", "prompt": "<chatml-prompt>", "candidates": ["label-a", ...], "length_normalize": true}`. The llama-cpp and vLLM backends implement Score; other backends return `UNIMPLEMENTED`. |
|
||
|
||
### MCP tools
|
||
|
||
| Tool | Read/Write | Purpose |
|
||
|---|---|---|
|
||
| `get_router_decisions` | read | Recent decision log with optional filters. |
|
||
| `get_middleware_status` | read | Includes the router section listing configured router models. |
|
||
|
||
Mutating routing config — adding a candidate, changing the classifier
|
||
model — is YAML-only today; reload with `POST /models/reload` to pick
|
||
up edits without restarting.
|
||
|
||
### Operational notes
|
||
|
||
- **Reload after YAML edits.** The router configs are loaded at startup
|
||
and cached. `POST /models/reload` re-reads from disk; the next request
|
||
rebuilds the classifier from the new config (the classifier cache is
|
||
fingerprinted by `yaml.Marshal(RouterConfig)` so it invalidates
|
||
automatically).
|
||
- **Classifier latency** on Arch-Router-1.5B Q4_K_M is ~500ms steady
|
||
for 3 policies on Intel SYCL. The score primitive re-decodes the full
|
||
prompt for every candidate today (the KV cache is cleared between
|
||
candidates); the prompt-KV-sharing optimization is on the perf TODO
|
||
list in `backend/cpp/llama-cpp/grpc-server.cpp::Score`. Until then,
|
||
`classifier_cache_size` is the highest-leverage knob for repeat-query
|
||
workloads (agent loops).
|
||
- **Decision log size**: 5,000-entry ring buffer per process. The
|
||
log is in-process and not persisted — pair with the usage log for
|
||
long-horizon audit.
|
||
|
||
---
|
||
|
||
## Related features
|
||
|
||
- [Cloud passthrough proxy]({{< relref "cloud-proxy.md" >}}) — combine
|
||
the router with `proxy-*` backends to send simple prompts to local
|
||
models and complex ones to cloud providers.
|
||
- [MITM proxy]({{< relref "mitm-proxy.md" >}}) — apply the same PII
|
||
filter to Claude Code, Codex CLI, and any HTTPS client without
|
||
LocalAI holding their API keys.
|
||
- [Authentication]({{< relref "authentication.md" >}}) — admin role is
|
||
required for mutating endpoints and the `/app/middleware` page; in
|
||
no-auth single-user mode the synthetic local user has admin role
|
||
automatically.
|