Files
LocalAI/tests/e2e/e2e_router_test.go
Richard Palethorpe 085fc53bbc fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104)
* fix(router): score classifier production-readiness

Conversation trimming runs through the classifier model's chat template
and trims by exact token count, sized to the model's n_batch which is
now scaled to context so long probes can't crash the backend. Missing
chat_message templates are a hard error at router build time. Router-
facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve
ModelConfig per call so a model installed post-startup doesn't bind a
stub Backend="" config and silently fall into the loader's auto-
iterate path.

New 'vector_store' backend trace recorded inside localVectorStore on
every Search/Insert — including the backend-load-failure path that
previously vanished into an xlog.Warn — with outcome tagging
(hit/miss/empty_store/backend_load_error/find_error/insert_error/ok).
Companion cleanup drops misleading similarity:0 and input_tokens_count:0
from non-hit and text-mode traces.

Gallery local-store-development aliases to 'local-store' so the master
image satisfies pkg/model.LocalStoreBackend lookups from the embedding
cache.

Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key
(the original bug); ModelTokenize nil-guard; non-fatal mitm proxy
startup; PII 'route_local' renamed to 'allow' with docs/UI in sync;
model-editor footer no longer eats the edit area on small screens;
several config-editor template/dropdown/section fixes.

Tests: e2e router specs (casual/code-hint + long-conversation trim),
vector_store trace specs, lazy-factory specs, gallery dev-alias
resolution, Playwright trace badge + scroll regression.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(backend): auto-size batch to context for embedding and rerank models

Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins.

Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse.

Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(gallery): raise arch-router scoring output cap via parallel:64

Scoring decodes the whole prompt+candidate in a single llama_decode and
reads one logit row per candidate token. The vendored llama.cpp server
caps causal output rows at n_parallel, so the default of 1 aborts with
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route
labels. Set options: [parallel:64] on both arch-router quant entries to
lift the cap; kv_unified (the grpc-server default) keeps the full context
per sequence, so this does not split the KV cache.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-12 16:21:15 +02:00

91 lines
4.0 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
package e2e_test
import (
"context"
"strings"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/openai/openai-go/v3"
)
// Router e2e: drives /v1/chat/completions through the RouteModel middleware
// against a configured score classifier (mock-classifier from the suite
// fixtures) and two candidates. The mock-backend's Score handler ranks
// candidates by looking for a `ROUTE_HINT=<label>` marker in the prompt and
// boosting the candidate whose label matches; without a hint, all candidates
// score equally and the router falls back. The ECHO_SERVED_MODEL trigger
// makes the chosen candidate echo its loaded model file path so the test can
// verify routing decisively rather than infer it from content shape.
var _ = Describe("Router E2E", Label("Router"), func() {
chat := func(message string) (*openai.ChatCompletion, error) {
return client.Chat.Completions.New(
context.TODO(),
openai.ChatCompletionNewParams{
Model: "smart-router",
Messages: []openai.ChatCompletionMessageParamUnion{
openai.UserMessage(message),
},
},
)
}
It("routes a casual probe to the casual-chat candidate", func() {
resp, err := chat("ROUTE_HINT=casual-chat ECHO_SERVED_MODEL")
Expect(err).ToNot(HaveOccurred())
Expect(resp.Choices).To(HaveLen(1))
Expect(resp.Choices[0].Message.Content).To(ContainSubstring("SERVED_MODEL=mock-cand-casual.bin"),
"casual hint should have routed to mock-cand-casual; got %q", resp.Choices[0].Message.Content)
})
It("routes a code probe to the code-generation candidate", func() {
resp, err := chat("ROUTE_HINT=code-generation ECHO_SERVED_MODEL")
Expect(err).ToNot(HaveOccurred())
Expect(resp.Choices).To(HaveLen(1))
Expect(resp.Choices[0].Message.Content).To(ContainSubstring("SERVED_MODEL=mock-cand-code.bin"),
"code hint should have routed to mock-cand-code; got %q", resp.Choices[0].Message.Content)
})
It("falls back when no policy label matches the probe", func() {
// No ROUTE_HINT marker — the mock Score handler gives every candidate
// the same base log-prob, softmax goes uniform, no label clears
// activation_threshold=0.40, so the router falls back to
// mock-cand-casual.
resp, err := chat("ECHO_SERVED_MODEL hello world")
Expect(err).ToNot(HaveOccurred())
Expect(resp.Choices).To(HaveLen(1))
Expect(resp.Choices[0].Message.Content).To(ContainSubstring("SERVED_MODEL=mock-cand-casual.bin"),
"unhinted probe should have fallen back; got %q", resp.Choices[0].Message.Content)
})
It("routes correctly over a long conversation (exercises fitMessages)", func() {
// Build a conversation long enough that the score classifier's
// probeTokenBudget kicks in and fitMessages has to trim. mock-backend's
// TokenizeString returns ~1 token per 4 prompt characters, and the
// classifier ContextSize is 4096, so >40k chars guarantees the trim
// path. The ROUTE_HINT marker is placed ONLY in the newest message —
// if fitMessages dropped it during trim, no candidate would win and we
// would route to the fallback (mock-cand-casual) instead of the code
// candidate.
filler := strings.Repeat("background context, lorem ipsum dolor sit amet. ", 200) // ~10k chars × 5 turns
msgs := make([]openai.ChatCompletionMessageParamUnion, 0, 6)
for range 5 {
msgs = append(msgs, openai.UserMessage(filler))
}
msgs = append(msgs, openai.UserMessage("ROUTE_HINT=code-generation ECHO_SERVED_MODEL"))
resp, err := client.Chat.Completions.New(
context.TODO(),
openai.ChatCompletionNewParams{Model: "smart-router", Messages: msgs},
)
Expect(err).ToNot(HaveOccurred(), "router must survive a long conversation without erroring")
Expect(resp.Choices).To(HaveLen(1))
// The newest turn carries the routing intent ("code"); fitMessages must
// keep it intact even after dropping older fillers, so the code
// candidate still wins.
Expect(resp.Choices[0].Message.Content).To(ContainSubstring("SERVED_MODEL=mock-cand-code.bin"),
"long-conversation routing must still resolve to the code candidate; got %q",
resp.Choices[0].Message.Content)
})
})