+++ title = "Middleware: PII filtering and intelligent routing" weight = 27 toc = true description = "Per-model PII redaction and policy-based request routing" tags = ["Routing", "Privacy", "PII", "Middleware", "Advanced"] categories = ["Features"] +++ LocalAI ships a request-middleware layer that sits between the HTTP API and the backend dispatcher. Two subsystems share that layer because they share the same lifecycle hook: **PII filtering** scans the request body before it reaches a backend (and the SSE stream on the way out), and the **intelligent router** rewrites `input.Model` so a single client-facing model name fans out across multiple downstream targets. Both are inspected and configured from the same admin page (`/app/middleware`), backed by the same REST surface (`/api/middleware/*`, `/api/pii/*`, `/api/router/*`) and the same MCP tools. ## Request lifecycle ``` client ── auth ── route-model ── per-model PII ── backend ── streaming PII ── client │ │ └─── decision log └─── event log ``` The router runs first (it picks the target model so per-model PII has something to gate on), per-model PII runs next (gated by the resolved config), the backend executes, and the streaming PII filter rewrites the SSE response in flight. Each subsystem writes to its own admin-visible log: `/api/router/decisions` for routing, `/api/pii/events` for redaction and block actions. --- ## PII filtering PII redaction is **per-model and off by default**. The default flips to **on for any backend whose name starts with `proxy-`** because that traffic crosses the network to a third-party provider. Explicit `pii.enabled` in a model's YAML always wins over the backend default. ### Pattern catalog The built-in regex tier ships six patterns. Each has a default action (`mask`, `block`, or `route_local`) and a length cap that prevents pathological inputs from blowing up scanning time: | ID | Description | Default action | Max length | |---|---|---|---| | `email` | Email address | `mask` | 254 | | `phone` | Phone number (international or US) | `mask` | 24 | | `ssn` | US Social Security Number | `mask` | 11 | | `credit_card` | Credit card number (Luhn-verified) | `mask` | 19 | | `ipv4` | IPv4 address | `mask` | 15 | | `api_key_prefix` | `sk-`, `pk-`, `xoxb-`, `ghp_`, `github_pat_` | **`block`** | 200 | `mask` rewrites the match to `[REDACTED:]` in the request body before forwarding. `block` returns HTTP 400 with `error.type=pii_blocked` to the client without forwarding. `route_local` is reserved for the routing integration (see below) and falls back to `mask` when no local route is available. ### Per-model configuration Add a `pii:` block to a model YAML to opt in (or out, or to override per-pattern actions): ```yaml # Local model — explicit opt-in so chats with this model get redaction # applied request-side. name: qwen-7b-local backend: llama-cpp pii: enabled: true ``` ```yaml # Cloud-bound model — defaults to enabled because backend is cloud-proxy. # Tighten api_key_prefix from the global default and downgrade email to # route_local so emails route to a local model rather than leaving the # network. name: claude-strict backend: cloud-proxy proxy: mode: passthrough provider: anthropic upstream_url: https://api.anthropic.com/v1/messages api_key_env: ANTHROPIC_API_KEY pii: patterns: - id: api_key_prefix action: block # already the default, made explicit for audit - id: email action: route_local ``` The regex itself stays global — only the action is settable per-model. Adding new patterns is a build-time concern (extend `patternRegexps` in `core/services/routing/pii/patterns.go`). ### NER tier (optional) The regex matcher covers high-precision patterns. For natural-language PII (proper names, addresses, organization names) LocalAI carries an **encoder NER tier** that runs after the regex pass. It expects a transformers token-classification model wired through the `TokenClassify` gRPC primitive (e.g. `dslim/bert-base-NER`). The detector annotates spans with an entity group (`PER`, `LOC`, `ORG`, `MISC`); per-group actions are configurable through the same `pii:` block. The NER tier ships as a contract (`NERDetector`, `NERConfig` in `core/services/routing/pii/ner.go`); an operator-facing knob to load and attach a detector is not plumbed yet. When no detector is configured the regex tier still runs. ### Streaming PII filter Buffered (`/v1/chat/completions` without `"stream": true`) responses are forwarded verbatim today — only the request-side scan runs. Streaming responses run through `pii.StreamFilter` which buffers SSE chunks until either a full pattern matches or the buffer's max length is reached, then emits the safe prefix. The streaming filter is what makes the cloud-proxy backend and the MITM proxy safe to expose to clients that issue streaming requests. The streaming filter is wired automatically for any model with `pii.enabled` true — there is no separate streaming toggle. ### Admin page The `/app/middleware` page (admin role only) has four tabs — **Filtering**, **Routing**, **MITM Proxy** (see the [MITM doc]({{< relref "mitm-proxy.md" >}})), and **Events**. The Filtering tab shows: - The pattern catalogue with live action dropdowns. Changing an action via the UI calls `PUT /api/pii/patterns/:id` and updates the live redactor in-process. Click **Persist** in the action header to write the current state into `runtime_settings.json` so the next process start re-applies it. - A per-model resolved-state table — each model row reports `enabled`, the per-pattern overrides, and which patterns are effectively active. - A live test panel that posts sample text to `/api/pii/test` and highlights matches with their resolved actions, without storing the text in the event log. ### REST surface | Method | Path | Auth | Purpose | |---|---|---|---| | GET | `/api/pii/patterns` | any | Live pattern list with current actions. Used by the UI catalogue. | | POST | `/api/pii/test` | any | Dry-run the redactor on `{"text":"..."}`. Returns hits and the would-be-rewritten body. Does not write to the event log. | | GET | `/api/pii/events` | admin | Recent middleware events — PII redactions, MITM connect/traffic, admission denials. Filterable by `correlation_id`, `user_id`, `pattern_id`, `kind`. | | PUT | `/api/pii/patterns/:id` | admin | Update a pattern in-process. Body accepts `{"action":"mask"\|"block"\|"route_local"}` and/or `{"disabled":true\|false}`. Transient — reverts on restart unless persisted. | | POST | `/api/pii/patterns/persist` | admin | Snapshot the live per-pattern (action, disabled) state into `runtime_settings.json`. | | GET | `/api/middleware/status` | admin | Aggregated dashboard data: patterns + per-model resolved state + router status + MITM status + admission status. One round-trip for the UI. | ### MCP tools The same surface is mirrored through the LocalAI Assistant MCP server so the in-process and stdio assistants can manage the filter conversationally: | Tool | Read/Write | Purpose | |---|---|---| | `list_pii_patterns` | read | Returns the live pattern list. | | `get_pii_events` | read | Recent redaction / block events with optional filters. | | `test_pii_redaction` | read | Dry-run sample text without writing to the event log. | | `get_middleware_status` | read | Aggregator — the same payload as `GET /api/middleware/status`. | | `set_pii_pattern_action` | write | Update a pattern's action. Admin-only. | | `persist_pii_patterns` | write | Snapshot live state to `runtime_settings.json`. Admin-only. | --- ## Intelligent routing A **router model** is a model whose YAML carries a `router:` block. When a client addresses it (`"model": "smart-router"`), the middleware classifies the prompt, picks a downstream candidate model, rewrites `input.Model` to the candidate, and the standard model-resolution path runs against that resolved target. ACL checks, disabled-state, and per-model PII all apply to the resolved model — the router does *model selection only*. #### Depth-1 invariant Candidates **must not** themselves be router models. A `smart-router → claude-strict → cloud-proxy` chain is fine (`claude-strict` is a regular cloud-proxy model). A `smart-router → other-router → real-model` chain is rejected at runtime by the middleware (the dispatcher returns HTTP 500 with a `depth-1 invariant` error). This keeps the dispatch graph acyclic and predictable. #### Fallback If no candidate's label set covers the active label set from the classifier, or the classifier errors out, the router uses `cfg.Router.Fallback`. An empty `fallback` causes the dispatch to fail with HTTP 500 rather than silently routing somewhere unintended — fail-fast, not silent-bypass. ### Available classifiers LocalAI ships two classifier implementations. Pick one with `classifier:` in the router YAML: | Classifier | When to use | Underlying primitive | |---|---|---| | `score` (default) | Small classifier-tuned LM (Arch-Router-style). Best when label vocabulary is well-covered by next-token continuation. | `Score` gRPC primitive (llama-cpp, vLLM). | | `colbert` | When label descriptions are abstract or short and a next-token classifier produces flat distributions. Robust on long-form policy descriptions. | rerankers backend in ColBERT mode (e.g. `bge-m3-colbert` from the gallery). | Both classifiers share the same YAML shape: `classifier_model`, `policies`, `candidates`, `fallback`, `activation_threshold`, `classifier_cache_size`, and the optional `embedding_cache` block. ### The Score classifier The `score` classifier works like this: 1. Build a Qwen/ChatML system prompt that lists every policy label with its description and primes the model to emit a label as the assistant turn. 2. Ask the classifier model to **score every policy label** as the first-token(s) continuation. This uses the `Score` gRPC primitive (`backend.proto::Score`), which returns per-candidate log-probabilities length-normalized so candidates of unequal token length stay comparable. 3. Softmax the length-normalized log-probabilities into a probability distribution over labels. 4. Threshold the distribution: every label whose probability passes `activation_threshold` joins the **active label set**. 5. Pick the FIRST candidate whose `Labels` is a superset of the active set. Admins order candidates smallest → largest so a single-label query routes to the smallest capable model, while a query that activates multiple labels falls to a candidate that covers them all. This is the Arch-Router approach extended for multi-label. The distribution carries more signal than the argmax — reading off the spread lets one prompt activate multiple policies and route to a model capable of all of them. #### Recommended classifier model [Arch-Router-1.5B](https://huggingface.co/katanemo/Arch-Router-1.5B) is the canonical choice. It's a Qwen-2.5-1.5B-Instruct base trained specifically on routing-policy continuation, so the ChatML system-prompt + label-continuation pattern produces well-separated label probabilities without prompt tuning. The Q4_K_M GGUF runs on CPU, GPU, and Intel SYCL. The classifier model must support the `Score` gRPC primitive (today: the llama-cpp and vLLM backends) and use the ChatML chat template. Any small ChatML instruct model works under those constraints, but expect flatter probability distributions which translate to a higher `activation_threshold` to keep noise out of the active label set. On llama-cpp, declare `known_usecases: [score]` on the classifier model — LocalAI rejects configs that combine `score` with `chat`/`completion`/`embeddings` there, because the Score RPC races the `llama_context` against slot-loop traffic. ### The Colbert classifier The `colbert` classifier reranks each policy *description* against the prompt via the rerankers backend and activates the labels whose relevance scores clear `activation_threshold` (default 0.5 for reranker-style scores in [0, 1]). ```yaml router: classifier: colbert classifier_model: bge-m3-colbert # gallery entry; loads BAAI/bge-m3 in ColBERT mode activation_threshold: 0.5 policies: - label: code-generation description: writing, debugging, reading, or explaining code - label: casual-chat description: small talk, greetings, jokes candidates: [...] ``` The reranker scores the *description* (natural English) rather than asking a small LM to score the *label* as a next-token continuation, so it tends to be more robust when policy labels are abstract slugs (`compliance-review`, `tier-2-support`). The trade-off is one reranker round-trip per request — bge-m3 in ColBERT mode is fast enough on GPU that this is comparable to the Score path for most workloads. The `embedding_cache` block applies identically. The reranker model's `type:` (in the model YAML) selects which underlying scoring head loads — `colbert` for late-interaction MaxSim, `cross-encoder` for cross-attention scoring. The classifier itself is indifferent; pick the head that fits your latency / quality budget. ### YAML reference ```yaml name: smart-router known_usecases: - chat router: # `score` (Arch-Router-style next-token scoring) or `colbert` # (rerank policy descriptions). See "Available classifiers" above. classifier: score # A model loaded by LocalAI that supports the Score gRPC primitive # (llama-cpp and vLLM ship implementations). Arch-Router-1.5B is the # canonical choice. classifier_model: arch-router-1.5b # Bounded LRU keyed on (case-folded, whitespace-trimmed) prompt — prompts # repeat in agent loops; the cache amortises the classifier round-trip # across them. 0 here means "use the default" (1024); the cache cannot be # disabled from YAML today. classifier_cache_size: 256 # Softmax probability floor a label must clear to join the active label set. # 0 = use the package default (0.15). 0.40 is a better empirical # starting point on Arch-Router-1.5B — see the tuning note below. activation_threshold: 0.40 # Used when no candidate covers the active label set, or the classifier # itself errors. Empty here = fail-fast with HTTP 500. fallback: qwen3-0.6b # The label vocabulary. Descriptions are fed verbatim into the # classifier's system prompt — short, action-oriented sentences work # best ("writing or debugging code", "small talk"). policies: - label: code-generation description: writing, debugging, reading, or explaining code in any programming language - label: casual-chat description: small talk, greetings, jokes, or general conversation with no specific task - label: math-reasoning description: arithmetic, equations, percentage calculations, or step-by-step word problems # Routing table — order matters (smallest → largest). See "Score # classifier" above for the matching rule. candidates: - model: qwen3-0.6b labels: [casual-chat] - model: qwen_qwen3.5-2b labels: [code-generation, casual-chat, math-reasoning] ``` ### Tuning `activation_threshold` The threshold is the single knob you'll want to tune per (classifier-model, policy-set) pair. On Arch-Router-1.5B with the three-policy setup above, sweeping the threshold over a hand-labeled 30-prompt corpus produced: | Threshold | Label-set accuracy | End-to-end routing accuracy | |---:|---:|---:| | 0.15 (package default) | 30% | 73% | | 0.30 | 57% | 87% | | **0.40** | **60%** | **90%** | | 0.45 | 67% | 97% | | 0.50 | 67% | 97% | The classifier's argmax matches the dominant label 93% of the time on this corpus — what the threshold controls is how much secondary-label noise leaks into the active label set. Low thresholds push single-label queries to multi-label-capable (larger) candidates unnecessarily; 0.40 keeps the dominant label dominant without losing genuine compound activations. Re-tune per (classifier-model, policy-set) pair. The `/api/score` endpoint (see below) is the convenient probe — it returns the raw length-normalized log-probabilities so you can sweep thresholds offline without driving real chat completions. ### Embedding cache (L2) Classification is the most expensive thing the middleware does. The score classifier already memo-caches verbatim repeats (case- and whitespace-folded prompt → decision); the **embedding cache** is the L2 tier that catches *semantically similar* prompts — "How do I exit vim?" and "i need to quit vim" can share a decision instead of running the classifier twice. Pairs naturally with a larger / slower classifier model: the steady-state cost on cache hits collapses to one embedding round-trip plus a KNN search, both well under 100ms with `nomic-embed-text-v1.5` + local-store. #### Configuration Add an `embedding_cache:` block to a router model: ```yaml router: classifier: score classifier_model: arch-router-1.5b policies: [...] candidates: [...] embedding_cache: embedding_model: nomic-embed-text-v1.5 # any loaded embedding model similarity_threshold: 0.80 # cosine sim floor for a hit (default 0.80) confidence_threshold: 0.60 # min top-label prob to cache a decision (default 0.60) # store_name: router-cache-smart-router # optional override; defaults to "router-cache-" ``` Omit the block entirely to disable. The cache adds two new failure modes (embedder unavailable, store unavailable) — both fall through to the inner classifier so routing keeps working. #### How it works For each request: 1. Embed the probe prompt via the configured `embedding_model`. 2. KNN top-1 against the per-router local-store collection. 3. If similarity ≥ `similarity_threshold`, return the cached decision (`Cached=true`, `CacheSimilarity=` in the decision log). 4. Miss → run the inner classifier. If `decision.score >= confidence_threshold`, insert `(embedding, decision)` into the store. Low-confidence decisions are deliberately skipped so they can't poison future paraphrases. The local-store collection is named `router-cache-` by default — each router gets its own collection so two routers can't cross-contaminate. Collections persist on disk (local-store is the canonical persistent vector backend), so the cache survives restarts. #### Tuning notes - **Similarity threshold**: 0.80 is the package default — re-tune per (embedding model, corpus). The histogram on the Routing tab shows where the cosine distribution actually sits; pick a threshold above the cross-intent cluster and below the paraphrase cluster. - **Confidence threshold**: 0.60 corresponds roughly to "the classifier is committed to a top label." Don't lower this — caching unsure decisions propagates the uncertainty. - **Cache flush**: invalidates automatically when the router YAML changes (the classifier cache is fingerprinted by `yaml.Marshal`), but the underlying local-store collection still holds the old payloads. Manual flush via local-store admin or by renaming `store_name` if you need a hard reset. - **Latency budget**: an embedding round-trip (typically 30–80ms for small embedding models) plus KNN search (~5ms) is added to every *miss* on top of the classifier latency. Cache hits skip the classifier entirely. Break-even is around 7–10% hit rate; agent loops with repeated phrasing easily exceed this. ### Admin page The `/app/middleware` page has a **Routing** tab listing every router model's classifier, policies, candidates, and fallback. The **Events** tab shows the decision log — one row per classified request with correlation ID, requested model, served model, classifier name, active labels, top-label score, and latency. Routing decisions are stored in an in-process ring buffer (default capacity 5,000). The decision log is for audit and tuning — the canonical usage log lives in `/api/usage` and correlates by request ID. ### REST surface | Method | Path | Auth | Purpose | |---|---|---|---| | GET | `/api/router/status` | any | Router configuration: each router model's classifier, policies, candidates. | | GET | `/api/router/decisions` | admin | Decision log with optional filters (`correlation_id`, `user_id`, `router_model`, `limit`). | | POST | `/api/score` | admin | Direct access to the `Score` gRPC primitive — useful for offline threshold tuning. Body: `{"model": "", "prompt": "", "candidates": ["label-a", ...], "length_normalize": true}`. The llama-cpp and vLLM backends implement Score; other backends return `UNIMPLEMENTED`. | ### MCP tools | Tool | Read/Write | Purpose | |---|---|---| | `get_router_decisions` | read | Recent decision log with optional filters. | | `get_middleware_status` | read | Includes the router section listing configured router models. | Mutating routing config — adding a candidate, changing the classifier model — is YAML-only today; reload with `POST /models/reload` to pick up edits without restarting. ### Operational notes - **Reload after YAML edits.** The router configs are loaded at startup and cached. `POST /models/reload` re-reads from disk; the next request rebuilds the classifier from the new config (the classifier cache is fingerprinted by `yaml.Marshal(RouterConfig)` so it invalidates automatically). - **Classifier latency** on Arch-Router-1.5B Q4_K_M is ~500ms steady for 3 policies on Intel SYCL. The score primitive re-decodes the full prompt for every candidate today (the KV cache is cleared between candidates); the prompt-KV-sharing optimization is on the perf TODO list in `backend/cpp/llama-cpp/grpc-server.cpp::Score`. Until then, `classifier_cache_size` is the highest-leverage knob for repeat-query workloads (agent loops). - **Decision log size**: 5,000-entry ring buffer per process. The log is in-process and not persisted — pair with the usage log for long-horizon audit. --- ## Related features - [Cloud passthrough proxy]({{< relref "cloud-proxy.md" >}}) — combine the router with `proxy-*` backends to send simple prompts to local models and complex ones to cloud providers. - [MITM proxy]({{< relref "mitm-proxy.md" >}}) — apply the same PII filter to Claude Code, Codex CLI, and any HTTPS client without LocalAI holding their API keys. - [Authentication]({{< relref "authentication.md" >}}) — admin role is required for mutating endpoints and the `/app/middleware` page; in no-auth single-user mode the synthetic local user has admin role automatically.