Files
LocalAI/docs/content/features/middleware.md
Richard Palethorpe 3fa7b2955c feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360)
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:

- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
  PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
  TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
  whole conversations as one document), with a recreated bounded restricted-
  regex secret-matching pattern detector tier alongside it (per-model
  pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
  debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
  filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
  editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
  filter; token_classify known_usecase; batch sized to context for NER models.
  privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
  meta + image entries with a capabilities map) matching its CI matrix jobs,
  and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
  privacy-filter GGUF detection) replacing the prior pref-only registration.

Reconciled against master's independent evolution:

- Dropped master's PIIPatternOverrides feature (global-pattern runtime
  overrides + /api/pii/patterns API + runtime_settings.json persistence). The
  per-model NER + pattern-detector design supersedes it; it was built on the
  global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
  removed the patch and restored master's grpc-server.cpp Score RPC (direct
  llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
  model_config validation forbidding score + chat/completion/embeddings on
  llama-cpp. token_classify is unaffected (it runs on the privacy-filter
  backend, not llama-cpp).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-18 11:45:22 +01:00

636 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
+++
title = "Middleware: PII filtering and intelligent routing"
weight = 27
toc = true
description = "Per-model PII redaction and policy-based request routing"
tags = ["Routing", "Privacy", "PII", "Middleware", "Advanced"]
categories = ["Features"]
+++
![The request lifecycle: one shared hook chain for auth, model routing, and PII, with decision and event logs](/images/diagrams/middleware-lifecycle.png)
LocalAI ships a request-middleware layer that sits between the HTTP API and
the backend dispatcher. Two subsystems share that layer because they share
the same lifecycle hook: **PII filtering** scans the request body before it
reaches a backend, and the **intelligent router** rewrites `input.Model` so
a single client-facing model name fans out across multiple downstream
targets.
Both are inspected and configured from the same admin page
(`/app/middleware`), backed by the same REST surface (`/api/middleware/*`,
`/api/pii/*`, `/api/router/*`) and the same MCP tools.
## Request lifecycle
```
client ── auth ── route-model ── per-model PII ── backend ── client
│ │
│ └─── event log
└─── decision log
```
The router runs first (it picks the target model so per-model PII has
something to gate on), per-model PII runs next (gated by the resolved
config), and the backend executes. Filtering is **request-side only**
the request body is scanned and rewritten before forwarding; the response
is not touched (NER over a streamed response is left as a follow-up). Each
subsystem writes to its own admin-visible log: `/api/router/decisions` for
routing, `/api/pii/events` for redaction and block actions.
---
## PII filtering
PII redaction is **NER-based and runs request-side (input)**. It is
**off by default**, flipping to **on for any `cloud-proxy` backend**
because that traffic crosses the network to a third-party provider. Pick a
[default detector](#instance-wide-defaults) so those models are actually
scanned. Explicit `pii.enabled` in a model's YAML always wins over the
backend default.
Filtering runs on every text-accepting endpoint that has an adapter wired:
`/v1/chat/completions` and `/v1/messages` (chat), `/v1/completions`,
`/v1/embeddings`, `/v1/edits`, and the Ollama `/api/chat`, `/api/generate`
and `/api/embed` endpoints, plus the [MITM proxy]({{< relref "mitm-proxy.md" >}})
request body. Image, audio (TTS/STT), video, rerank, and the realtime
WebSocket are not filtered yet (different prompt-PII semantics; realtime is
not HTTP middleware).
A request's messages are scanned **as one document** (joined in order), so
the NER detector keeps conversational context: whether `4421` is a PIN or
`jdoe_42` is a username is usually decided by the question asked in the
*previous* message, and a bidirectional encoder only sees that context when
the messages share a forward pass. Detected spans are mapped back to the
individual message they fall in, so redaction still rewrites each message
field in place and events carry message-local offsets.
> The earlier regex pattern tier (`pii.patterns`, the built-in pattern
> catalogue, `--pii-config`, the `/api/pii/patterns|test|decide` endpoints)
> and response/streaming-side redaction have been **removed**. Detection is
> now driven entirely by token-classification (NER) models. Legacy keys
> no-op with a startup warning.
### Detector models
A **detector** is a `token_classify` model (e.g. an `openai-privacy-filter`
GGUF) that carries the detection *policy* in a top-level `pii_detection:`
block — defined once, on the model itself:
```yaml
name: privacy-filter-multilingual
backend: privacy-filter
embeddings: true # TOKEN_CLS pooling
known_usecases:
- token_classify
pii_detection:
min_score: 0.5 # drop detections below this confidence
default_action: mask # applied to any detected group with no entry
entity_actions: # which PII to block vs mask vs allow-log
PASSWORD: block
CREDITCARD: block
EMAIL: mask
```
`mask` rewrites the matched span to `[REDACTED:ner:<GROUP>]` in the request
body before forwarding. `block` returns HTTP 400 (`error.type=pii_blocked`)
without forwarding. `allow` detects and logs (a PIIEvent is still recorded)
but leaves the text unchanged. The entity-group names are whatever the model
emits (the privacy-filter family uses uppercase names like `EMAIL`,
`PASSWORD`, `CREDITCARD`).
### Pattern detector tier
NER is the wrong tool for high-entropy, highly-regular **secrets** — API keys,
tokens, private-key blocks. A trained NER model has no "API key" class, so it
fragments a key into the nearest categories it *does* know and can leave the
secret part exposed. Those secrets are exactly what a regex catches cheaply.
A **pattern detector** is a detector model (`backend: pattern`) that matches
secrets with a **restricted regex subset** compiled to Go's RE2 engine —
linear-time, no backtracking, no ReDoS. It runs entirely in-process: no model
download, no backend, zero VRAM. Install the gallery's **`secret-filter`** for a
ready-made set, or define your own:
```yaml
name: secret-filter
backend: pattern
known_usecases: [token_classify] # so it appears in the detector picker
pii_detection:
default_action: block # a leaked credential shouldn't leave
builtins: # built-in catalogue (enable by name)
- anthropic_api_key
- openai_api_key
- github_token
- aws_access_key
- private_key_block
patterns: # operator-defined, restricted subset
- name: INTERNAL_TOKEN
match: "tok-[A-Za-z0-9]{32,64}"
action: block # optional per-pattern override
min_len: 36 # optional length floor
```
A match is reported under its group (built-in group name, or the pattern
`name`), so `entity_actions` / `default_action` apply exactly as for NER.
**The restricted grammar** (validated at load — an invalid pattern is rejected,
not silently ignored):
- Allowed: literals, character classes `[…]` and `\w \d \s`, alternation,
anchors `^ $ \b`, and quantifiers `? * + {m,n}`.
- Rejected: `.` (any-char), capturing groups, and `{n,m}` bounds over 4096.
- **Required anchor**: every pattern must contain a fixed literal run of at
least 3 characters (e.g. `sk-ant-`, `ghp_`, `AKIA`). This admits real key
shapes but rejects open-ended ones — an email or a bare `\w+` has no such
anchor and belongs to the [NER tier](#detector-models).
Use both tiers together: reference an NER detector *and* a pattern detector in a
model's `pii.detectors` (or as instance defaults); their hits union, and a
`block` from either rejects the request.
### Consuming models
Any model opts in by enabling PII and referencing one or more detectors —
no per-consumer policy:
```yaml
name: claude-strict
backend: cloud-proxy
proxy:
mode: passthrough
provider: anthropic
upstream_url: https://api.anthropic.com/v1/messages
api_key_env: ANTHROPIC_API_KEY
pii:
enabled: true # default-on for cloud-proxy; explicit for audit
detectors:
- privacy-filter-multilingual
```
Multiple detectors **union** their detections; overlapping spans resolve to
the strongest action (`block` > `mask` > `allow`). A configured detector
that can't be loaded **fails the request closed** (HTTP 503,
`error.type=pii_ner_unavailable`) rather than silently skipping the check.
The same NER path runs on the [MITM proxy]({{< relref "mitm-proxy.md" >}})
request body for intercepted hosts. Response/output redaction is out of
scope for now.
### Instance-wide default detector
The **Detector models** table on the Middleware → Filtering page lists every
`token_classify` detector model (neural NER models and in-process pattern
matchers alike) and exposes a per-row **Default** toggle. Toggling a detector
on adds it to the instance-wide default detector set — one or more models
applied to any PII-enabled model that names none of its own `pii.detectors`.
It is persisted through `POST /api/settings` and read live, so a change takes
effect on the next request without a restart. A default that names a model no
longer loaded still appears (marked *not loaded*) so it can be toggled off.
This is what makes `cloud-proxy` / MITM redaction work out of the box: those
backends default to PII-enabled but ship no detector list, so without a
default detector the filter runs with nothing to scan. Set one here and
cloud-proxy traffic is scanned with no per-model config.
Resolution precedence (the single decision point is `ResolvePIIPolicy`,
shared by the chat middleware and the MITM listener so both agree):
1. An explicit `pii.enabled` on the model wins — `true` or `false`.
2. Otherwise PII is on if the backend defaults it on (`cloud-proxy`).
3. Detectors are the model's own `pii.detectors`; if it lists none, the
instance-wide default detector(s) are used.
A model that resolves enabled but ends up with no detector at all (a
cloud-proxy model with no model detectors and no instance default) scans
nothing — set a default detector to close that gap.
### Admin page
The `/app/middleware` page (admin role only) has four tabs — **Filtering**,
**Routing**, **MITM Proxy** (see the [MITM doc]({{< relref "mitm-proxy.md" >}})),
and **Events**. The Filtering tab has a **Detector models** table (every
`token_classify` filter model, with the per-row Default toggle above and an
edit link to each detector's config, plus an *Add detector model* button) and
a per-model table listing only the models PII can actually apply to — chat /
completion / embeddings / edit consumers and cloud-proxy models, not
VAD/STT/image models or the detector models themselves. Each row reports the
**effective** `enabled` state as an inline **toggle** — flipping it writes an
explicit `pii.enabled` to that model's YAML (a server-side deep-merge that
preserves `pii.detectors` and every other field), so a cloud-proxy model shown
on by backend default can be turned off, and vice-versa — plus the
resolved detector(s) — with a *(default)* marker when they come from the
instance-wide default rather than the model's YAML — why it is on (`YAML` /
`backend default`), and the recent event count. Detection *policy*
(entity→action, min score) is still edited on each detector model's config
(Models → edit → PII), not globally.
### Analyze / redact API
The same detection pipeline is also exposed as a standalone service, so a
client can scan or sanitise a string **without** routing a full chat request
through it (the inline path above). Two endpoints, both requiring a normal API
key (the `pii_filter` feature — not admin):
- `POST /api/pii/analyze` — detect only. Returns the matched entity spans
(`entity_type`, `source` `ner`|`pattern`, `start`/`end`, `score`, `action`)
and a `blocked` flag, **without modifying the text**.
- `POST /api/pii/redact` — apply the configured policy. Returns `redacted_text`
(with masked spans replaced by `[REDACTED:<id>]`) and `masked`; when a `block`
action fires it returns `400` with `type: pii_blocked` and the offending
entities — never a redacted body.
Both take the same request: `text` plus a detector selection — either explicit
detector model names in `detectors`, or a consuming `model` whose **effective**
policy is used: the model's own `pii.detectors`, else the
[instance-wide default detectors](#instance-wide-default-detector), exactly as
the inline filter resolves them. A `model` with PII disabled — or enabled but
with no detector anywhere — is a `400`: the inline filter would scan nothing
for it, and the API says so rather than implying a clean scan. The detection
policy lives on the detector models exactly as for the inline filter. The raw
matched value is never returned (an admin may pass `reveal: true` to include
the audit `hash_prefix`).
`text` is scanned as a single document. To reproduce the inline filter's
conversation-context behaviour for multi-message content, join the messages
with blank lines into one `text` — NER detection quality depends on that
context (a bare `4421` is nothing; after "what are the last four digits of
your card?" it is a PIN).
```bash
# Redact with an explicit pattern/NER detector
curl -sX POST http://localhost:8080/api/pii/redact \
-H 'Authorization: Bearer $API_KEY' -H 'Content-Type: application/json' \
-d '{"text":"reach me at jane@acme.io","detectors":["my-ner-model"]}'
# => {"redacted_text":"reach me at [REDACTED:ner:EMAIL]","masked":true,...}
# Analyze using a consuming model's configured detectors
curl -sX POST http://localhost:8080/api/pii/analyze \
-H 'Authorization: Bearer $API_KEY' -H 'Content-Type: application/json' \
-d '{"text":"sk-ant-api03-…","model":"gpt-4"}'
# => {"entities":[{"entity_type":"ANTHROPIC_KEY","source":"pattern",...,"action":"block"}],"blocked":true}
```
Calls are audited in the same event log, tagged with an `origin` of
`pii_analyze` / `pii_redact` (the inline filter records `middleware`, the MITM
proxy records `proxy`), so `GET /api/pii/events?origin=pii_redact` shows just
the redact-API rows.
### REST surface
| Method | Path | Auth | Purpose |
|---|---|---|---|
| POST | `/api/pii/analyze` | api key (`pii_filter`) | Detect PII in a string; returns entity spans, no mutation. |
| POST | `/api/pii/redact` | api key (`pii_filter`) | Redact a string per policy; returns `redacted_text` or `400 pii_blocked`. |
| GET | `/api/pii/events` | admin | Recent middleware events — PII redactions, MITM connect/traffic, admission denials. Filterable by `correlation_id`, `user_id`, `pattern_id` (e.g. `ner:EMAIL`), `kind`, `origin`. |
| GET | `/api/middleware/status` | admin | Aggregated dashboard data: per-model PII state + detectors + router status + MITM status + admission status. One round-trip for the UI. |
### MCP tools
The same surface is mirrored through the LocalAI Assistant MCP server:
| Tool | Read/Write | Purpose |
|---|---|---|
| `get_pii_events` | read | Recent redaction / block events with optional filters. |
| `get_middleware_status` | read | Aggregator — the same payload as `GET /api/middleware/status`. |
Detection policy is part of a detector model's config, so it is managed
through the model-config tools (`edit_model_config`), not a dedicated PII
tool.
---
## Intelligent routing
A **router model** is a model whose YAML carries a `router:` block. When
a client addresses it (`"model": "smart-router"`), the middleware
classifies the prompt, picks a downstream candidate model, rewrites
`input.Model` to the candidate, and the standard model-resolution path
runs against that resolved target. ACL checks, disabled-state, and
per-model PII all apply to the resolved model — the router does
*model selection only*.
#### Depth-1 invariant
Candidates **must not** themselves be router models. A
`smart-router → claude-strict → cloud-proxy` chain is fine
(`claude-strict` is a regular cloud-proxy model). A
`smart-router → other-router → real-model` chain is rejected at runtime
by the middleware (the dispatcher returns HTTP 500 with a
`depth-1 invariant` error). This keeps the dispatch graph acyclic and
predictable.
#### Fallback
If no candidate's label set covers the active label set from the classifier,
or the classifier errors out, the router uses `cfg.Router.Fallback`.
An empty `fallback` causes the dispatch to fail with HTTP 500 rather
than silently routing somewhere unintended — fail-fast, not
silent-bypass.
### Available classifiers
LocalAI ships two classifier implementations. Pick one with `classifier:`
in the router YAML:
| Classifier | When to use | Underlying primitive |
|---|---|---|
| `score` (default) | Small classifier-tuned LM (Arch-Router-style). Best when label vocabulary is well-covered by next-token continuation. | `Score` gRPC primitive (llama-cpp, vLLM). |
| `colbert` | When label descriptions are abstract or short and a next-token classifier produces flat distributions. Robust on long-form policy descriptions. | rerankers backend in ColBERT mode (e.g. `bge-m3-colbert` from the gallery). |
Both classifiers share the same YAML shape: `classifier_model`,
`policies`, `candidates`, `fallback`, `activation_threshold`,
`classifier_cache_size`, and the optional `embedding_cache` block.
### The Score classifier
The `score` classifier works like this:
1. Build a Qwen/ChatML system prompt that lists every policy label with
its description and primes the model to emit a label as the assistant
turn.
2. Ask the classifier model to **score every policy label** as the
first-token(s) continuation. This uses the `Score` gRPC primitive
(`backend.proto::Score`), which returns per-candidate log-probabilities
length-normalized so candidates of unequal token length stay
comparable.
3. Softmax the length-normalized log-probabilities into a probability
distribution over labels.
4. Threshold the distribution: every label whose probability passes
`activation_threshold` joins the **active label set**.
5. Pick the FIRST candidate whose `Labels` is a superset of the active
set. Admins order candidates smallest → largest so a single-label
query routes to the smallest capable model, while a query that
activates multiple labels falls to a candidate that covers them all.
This is the Arch-Router approach extended for multi-label. The
distribution carries more signal than the argmax — reading off the
spread lets one prompt activate multiple policies and route to a model
capable of all of them.
#### Recommended classifier model
[Arch-Router-1.5B](https://huggingface.co/katanemo/Arch-Router-1.5B) is
the canonical choice. It's a Qwen-2.5-1.5B-Instruct base trained
specifically on routing-policy continuation, so the ChatML system-prompt
+ label-continuation pattern produces well-separated label probabilities
without prompt tuning. The Q4_K_M GGUF runs on CPU, GPU, and Intel SYCL.
The classifier model must support the `Score` gRPC primitive (today: the
llama-cpp and vLLM backends) and use the ChatML chat template. Any small
ChatML instruct model works under those constraints, but expect flatter
probability distributions which translate to a higher
`activation_threshold` to keep noise out of the active label set.
On llama-cpp, scoring rides the server's task queue alongside
generation and embeddings, so the classifier may share a model config
with `chat`/`completion`/`embeddings` — a dedicated scorer model is no
longer required. Repeated calls with the same prompt also reuse the
prompt's KV cache across candidates.
### The Colbert classifier
The `colbert` classifier reranks each policy *description* against the
prompt via the rerankers backend and activates the labels whose
relevance scores clear `activation_threshold` (default 0.5 for
reranker-style scores in [0, 1]).
```yaml
router:
classifier: colbert
classifier_model: bge-m3-colbert # gallery entry; loads BAAI/bge-m3 in ColBERT mode
activation_threshold: 0.5
policies:
- label: code-generation
description: writing, debugging, reading, or explaining code
- label: casual-chat
description: small talk, greetings, jokes
candidates: [...]
```
The reranker scores the *description* (natural English) rather than
asking a small LM to score the *label* as a next-token continuation,
so it tends to be more robust when policy labels are abstract slugs
(`compliance-review`, `tier-2-support`). The trade-off is one
reranker round-trip per request — bge-m3 in ColBERT mode is fast
enough on GPU that this is comparable to the Score path for most
workloads. The `embedding_cache` block applies identically.
The reranker model's `type:` (in the model YAML) selects which
underlying scoring head loads — `colbert` for late-interaction MaxSim,
`cross-encoder` for cross-attention scoring. The classifier itself is
indifferent; pick the head that fits your latency / quality budget.
### YAML reference
```yaml
name: smart-router
known_usecases:
- chat
router:
# `score` (Arch-Router-style next-token scoring) or `colbert`
# (rerank policy descriptions). See "Available classifiers" above.
classifier: score
# A model loaded by LocalAI that supports the Score gRPC primitive
# (llama-cpp and vLLM ship implementations). Arch-Router-1.5B is the
# canonical choice.
classifier_model: arch-router-1.5b
# Bounded LRU keyed on (case-folded, whitespace-trimmed) prompt — prompts
# repeat in agent loops; the cache amortises the classifier round-trip
# across them. 0 here means "use the default" (1024); the cache cannot be
# disabled from YAML today.
classifier_cache_size: 256
# Softmax probability floor a label must clear to join the active label set.
# 0 = use the package default (0.15). 0.40 is a better empirical
# starting point on Arch-Router-1.5B — see the tuning note below.
activation_threshold: 0.40
# Used when no candidate covers the active label set, or the classifier
# itself errors. Empty here = fail-fast with HTTP 500.
fallback: qwen3-0.6b
# The label vocabulary. Descriptions are fed verbatim into the
# classifier's system prompt — short, action-oriented sentences work
# best ("writing or debugging code", "small talk").
policies:
- label: code-generation
description: writing, debugging, reading, or explaining code in any programming language
- label: casual-chat
description: small talk, greetings, jokes, or general conversation with no specific task
- label: math-reasoning
description: arithmetic, equations, percentage calculations, or step-by-step word problems
# Routing table — order matters (smallest → largest). See "Score
# classifier" above for the matching rule.
candidates:
- model: qwen3-0.6b
labels: [casual-chat]
- model: qwen_qwen3.5-2b
labels: [code-generation, casual-chat, math-reasoning]
```
### Tuning `activation_threshold`
The threshold is the single knob you'll want to tune per
(classifier-model, policy-set) pair. On Arch-Router-1.5B with the
three-policy setup above, sweeping the threshold over a hand-labeled
30-prompt corpus produced:
| Threshold | Label-set accuracy | End-to-end routing accuracy |
|---:|---:|---:|
| 0.15 (package default) | 30% | 73% |
| 0.30 | 57% | 87% |
| **0.40** | **60%** | **90%** |
| 0.45 | 67% | 97% |
| 0.50 | 67% | 97% |
The classifier's argmax matches the dominant label 93% of the time on
this corpus — what the threshold controls is how much secondary-label
noise leaks into the active label set. Low thresholds push single-label
queries to multi-label-capable (larger) candidates unnecessarily; 0.40
keeps the dominant label dominant without losing genuine compound
activations.
Re-tune per (classifier-model, policy-set) pair. The `/api/score`
endpoint (see below) is the convenient probe — it returns the raw
length-normalized log-probabilities so you can sweep thresholds offline
without driving real chat completions.
### Embedding cache (L2)
Classification is the most expensive thing the middleware does. The
score classifier already memo-caches verbatim repeats (case- and
whitespace-folded prompt → decision); the **embedding cache** is the
L2 tier that catches *semantically similar* prompts — "How do I exit
vim?" and "i need to quit vim" can share a decision instead of running
the classifier twice.
Pairs naturally with a larger / slower classifier model: the steady-state
cost on cache hits collapses to one embedding round-trip plus a KNN
search, both well under 100ms with `nomic-embed-text-v1.5` + local-store.
#### Configuration
Add an `embedding_cache:` block to a router model:
```yaml
router:
classifier: score
classifier_model: arch-router-1.5b
policies: [...]
candidates: [...]
embedding_cache:
embedding_model: nomic-embed-text-v1.5 # any loaded embedding model
similarity_threshold: 0.80 # cosine sim floor for a hit (default 0.80)
confidence_threshold: 0.60 # min top-label prob to cache a decision (default 0.60)
# store_name: router-cache-smart-router # optional override; defaults to "router-cache-<router>"
```
Omit the block entirely to disable. The cache adds two new failure modes
(embedder unavailable, store unavailable) — both fall through to the
inner classifier so routing keeps working.
#### How it works
For each request:
1. Embed the probe prompt via the configured `embedding_model`.
2. KNN top-1 against the per-router local-store collection.
3. If similarity ≥ `similarity_threshold`, return the cached decision
(`Cached=true`, `CacheSimilarity=<sim>` in the decision log).
4. Miss → run the inner classifier. If `decision.score >= confidence_threshold`,
insert `(embedding, decision)` into the store. Low-confidence
decisions are deliberately skipped so they can't poison future
paraphrases.
The local-store collection is named `router-cache-<router-model-name>` by
default — each router gets its own collection so two routers can't
cross-contaminate. Collections persist on disk (local-store is the
canonical persistent vector backend), so the cache survives restarts.
#### Tuning notes
- **Similarity threshold**: 0.80 is the package default — re-tune
per (embedding model, corpus). The histogram on the Routing tab
shows where the cosine distribution actually sits; pick a
threshold above the cross-intent cluster and below the paraphrase
cluster.
- **Confidence threshold**: 0.60 corresponds roughly to "the
classifier is committed to a top label." Don't lower this — caching
unsure decisions propagates the uncertainty.
- **Cache flush**: invalidates automatically when the router YAML
changes (the classifier cache is fingerprinted by `yaml.Marshal`),
but the underlying local-store collection still holds the old
payloads. Manual flush via local-store admin or by renaming
`store_name` if you need a hard reset.
- **Latency budget**: an embedding round-trip (typically 3080ms for
small embedding models) plus KNN search (~5ms) is added to every
*miss* on top of the classifier latency. Cache hits skip the
classifier entirely. Break-even is around 710% hit rate; agent
loops with repeated phrasing easily exceed this.
### Admin page
The `/app/middleware` page has a **Routing** tab listing every router
model's classifier, policies, candidates, and fallback. The **Events**
tab shows the decision log — one row per classified request with
correlation ID, requested model, served model, classifier name, active
labels, top-label score, and latency.
Routing decisions are stored in an in-process ring buffer (default
capacity 5,000). The decision log is for audit and tuning — the
canonical usage log lives in `/api/usage` and correlates by request ID.
### REST surface
| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | `/api/router/status` | any | Router configuration: each router model's classifier, policies, candidates. |
| GET | `/api/router/decisions` | admin | Decision log with optional filters (`correlation_id`, `user_id`, `router_model`, `limit`). |
| POST | `/api/score` | admin | Direct access to the `Score` gRPC primitive — useful for offline threshold tuning. Body: `{"model": "<classifier-model>", "prompt": "<chatml-prompt>", "candidates": ["label-a", ...], "length_normalize": true}`. The llama-cpp and vLLM backends implement Score; other backends return `UNIMPLEMENTED`. |
### MCP tools
| Tool | Read/Write | Purpose |
|---|---|---|
| `get_router_decisions` | read | Recent decision log with optional filters. |
| `get_middleware_status` | read | Includes the router section listing configured router models. |
Mutating routing config — adding a candidate, changing the classifier
model — is YAML-only today; reload with `POST /models/reload` to pick
up edits without restarting.
### Operational notes
- **Reload after YAML edits.** The router configs are loaded at startup
and cached. `POST /models/reload` re-reads from disk; the next request
rebuilds the classifier from the new config (the classifier cache is
fingerprinted by `yaml.Marshal(RouterConfig)` so it invalidates
automatically).
- **Classifier latency** on Arch-Router-1.5B Q4_K_M is ~500ms steady
for 3 policies on Intel SYCL. The score primitive re-decodes the full
prompt for every candidate today (the KV cache is cleared between
candidates); the prompt-KV-sharing optimization is on the perf TODO
list in `backend/cpp/llama-cpp/grpc-server.cpp::Score`. Until then,
`classifier_cache_size` is the highest-leverage knob for repeat-query
workloads (agent loops).
- **Decision log size**: 5,000-entry ring buffer per process. The
log is in-process and not persisted — pair with the usage log for
long-horizon audit.
---
## Related features
- [Cloud passthrough proxy]({{< relref "cloud-proxy.md" >}}) — combine
the router with `proxy-*` backends to send simple prompts to local
models and complex ones to cloud providers.
- [MITM proxy]({{< relref "mitm-proxy.md" >}}) — apply the same PII
filter to Claude Code, Codex CLI, and any HTTPS client without
LocalAI holding their API keys.
- [Authentication]({{< relref "authentication.md" >}}) — admin role is
required for mutating endpoints and the `/app/middleware` page; in
no-auth single-user mode the synthetic local user has admin role
automatically.