mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-14 11:49:33 -04:00
* fix(router): score classifier production-readiness Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(backend): auto-size batch to context for embedding and rerank models Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(gallery): raise arch-router scoring output cap via parallel:64 Scoring decodes the whole prompt+candidate in a single llama_decode and reads one logit row per candidate token. The vendored llama.cpp server caps causal output rows at n_parallel, so the default of 1 aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route labels. Set options: [parallel:64] on both arch-router quant entries to lift the cap; kv_unified (the grpc-server default) keeps the full context per sequence, so this does not split the KV cache. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
323 lines
16 KiB
Go
323 lines
16 KiB
Go
package localaitools
|
|
|
|
// DTOs for the LocalAIClient interface. Where the same shape already exists
|
|
// elsewhere (config.Gallery, gallery.Metadata, schema.KnownBackend,
|
|
// vram.EstimateResult) we surface that type directly via the interface
|
|
// instead of maintaining a parallel DTO. The remaining types in this file
|
|
// are LLM-shaped views of internal state where the source struct carries
|
|
// fields the LLM shouldn't see (auth tokens, filesystem paths) or
|
|
// non-JSON-friendly fields (e.g. galleryop.OpStatus.Error which marshals
|
|
// to "{}" because it's an interface).
|
|
|
|
// GallerySearchQuery is the input for gallery_search.
|
|
type GallerySearchQuery struct {
|
|
Query string `json:"query" jsonschema:"Free-text query matched against model name, gallery and tags. Empty returns the first Limit models."`
|
|
Limit int `json:"limit,omitempty" jsonschema:"Maximum number of results to return. Defaults to 20 when zero or negative."`
|
|
Tag string `json:"tag,omitempty" jsonschema:"Optional tag filter (e.g. chat, embed, image)."`
|
|
Gallery string `json:"gallery,omitempty" jsonschema:"Restrict results to a specific gallery name."`
|
|
}
|
|
|
|
// InstalledModel is one entry in list_installed_models. Distinct from
|
|
// config.ModelConfig (which is the full on-disk YAML — far too large to
|
|
// serialise per request); this is a summary the LLM can scan cheaply.
|
|
type InstalledModel struct {
|
|
Name string `json:"name"`
|
|
Backend string `json:"backend,omitempty"`
|
|
Capabilities []string `json:"capabilities,omitempty"`
|
|
Pinned bool `json:"pinned,omitempty"`
|
|
Disabled bool `json:"disabled,omitempty"`
|
|
}
|
|
|
|
// JobStatus is a JSON-friendly mirror of galleryop.OpStatus. We don't surface
|
|
// OpStatus directly because its `Error error` field marshals to `{}` (the
|
|
// json.Marshal default for an error interface), and the underlying status
|
|
// map keys jobs by UUID rather than carrying the ID on the value, so we
|
|
// add the ID here too. Keep field names aligned with OpStatus where they
|
|
// overlap so callers comparing the two don't have to translate.
|
|
type JobStatus struct {
|
|
ID string `json:"id"`
|
|
Processed bool `json:"processed"`
|
|
Cancelled bool `json:"cancelled,omitempty"`
|
|
Progress float64 `json:"progress"`
|
|
TotalFileSize string `json:"total_file_size,omitempty"`
|
|
DownloadedFileSize string `json:"downloaded_file_size,omitempty"`
|
|
Message string `json:"message,omitempty"`
|
|
ErrorMessage string `json:"error,omitempty"`
|
|
}
|
|
|
|
// ModelConfigView is a JSON view of a model config file.
|
|
type ModelConfigView struct {
|
|
Name string `json:"name"`
|
|
YAML string `json:"yaml,omitempty" jsonschema:"Full YAML serialization of the model config."`
|
|
JSON map[string]any `json:"json,omitempty" jsonschema:"Parsed JSON view of the same config (convenience for diffing)."`
|
|
}
|
|
|
|
// InstallModelRequest is the input for install_model.
|
|
type InstallModelRequest struct {
|
|
GalleryName string `json:"gallery_name,omitempty" jsonschema:"The gallery the model lives in (from gallery_search). Optional when ModelName is unique across galleries."`
|
|
ModelName string `json:"model_name" jsonschema:"The canonical model name as returned by gallery_search."`
|
|
Overrides map[string]any `json:"overrides,omitempty" jsonschema:"Optional config overrides to merge into the installed model's YAML."`
|
|
}
|
|
|
|
// InstallBackendRequest is the input for install_backend.
|
|
type InstallBackendRequest struct {
|
|
GalleryName string `json:"gallery_name,omitempty" jsonschema:"Source backend gallery."`
|
|
BackendName string `json:"backend_name" jsonschema:"Backend identifier (e.g. llama-cpp)."`
|
|
}
|
|
|
|
// Backend is the LLM-facing summary returned by list_backends. We don't
|
|
// expose gallery.SystemBackend directly because it carries filesystem
|
|
// paths (RunFile, IsSystem, IsMeta, the full Metadata) the LLM doesn't
|
|
// need and the tokens add up. ListKnownBackends returns schema.KnownBackend
|
|
// directly — that one is already the canonical wire shape.
|
|
type Backend struct {
|
|
Name string `json:"name"`
|
|
Installed bool `json:"installed"`
|
|
}
|
|
|
|
// SystemInfo summarises the LocalAI deployment.
|
|
type SystemInfo struct {
|
|
Version string `json:"version"`
|
|
Distributed bool `json:"distributed"`
|
|
BackendsPath string `json:"backends_path,omitempty"`
|
|
ModelsPath string `json:"models_path,omitempty"`
|
|
LoadedModels []string `json:"loaded_models,omitempty"`
|
|
InstalledBackends []string `json:"installed_backends,omitempty"`
|
|
}
|
|
|
|
// Node is one entry in list_nodes.
|
|
type Node struct {
|
|
ID string `json:"id"`
|
|
Address string `json:"address,omitempty"`
|
|
HTTPAddress string `json:"http_address,omitempty"`
|
|
TotalVRAM uint64 `json:"total_vram,omitempty"`
|
|
Healthy bool `json:"healthy"`
|
|
LastSeen string `json:"last_seen,omitempty"`
|
|
}
|
|
|
|
// ImportModelURIRequest is the input for import_model_uri. It mirrors the
|
|
// REST surface (`/models/import-uri`) closely so both clients can produce
|
|
// identical responses; the BackendPreference is a flat field rather than the
|
|
// REST `preferences` JSON blob since the LLM only needs to specify a backend
|
|
// name when it disambiguates a multi-backend match.
|
|
type ImportModelURIRequest struct {
|
|
URI string `json:"uri" jsonschema:"The model source. Accepts HuggingFace URLs (https://huggingface.co/...), OCI image references, http(s) URLs to a manifest, file:// paths, or a bare HF repo (e.g. Qwen/Qwen3-4B-GGUF)."`
|
|
BackendPreference string `json:"backend_preference,omitempty" jsonschema:"Optional backend name (e.g. llama-cpp). Required as the second-step retry when a previous import_model_uri call returned ambiguous_backend=true."`
|
|
Overrides map[string]any `json:"overrides,omitempty" jsonschema:"Optional config overrides applied to the discovered model (e.g. context_size)."`
|
|
}
|
|
|
|
// ImportModelURIResponse is what import_model_uri returns. When
|
|
// AmbiguousBackend is true the LLM must surface the candidates to the user
|
|
// and call again with BackendPreference set; the JobID is empty in that case.
|
|
type ImportModelURIResponse struct {
|
|
JobID string `json:"job_id,omitempty"`
|
|
DiscoveredModelName string `json:"discovered_model_name,omitempty"`
|
|
AmbiguousBackend bool `json:"ambiguous_backend,omitempty"`
|
|
Modality string `json:"modality,omitempty"`
|
|
BackendCandidates []string `json:"backend_candidates,omitempty"`
|
|
Hint string `json:"hint,omitempty"`
|
|
}
|
|
|
|
// Branding is the LLM-facing view of the instance's whitelabel settings.
|
|
// Only the configurable text fields and the resolved asset URLs are
|
|
// surfaced — the backing filenames on disk stay an implementation detail.
|
|
type Branding struct {
|
|
InstanceName string `json:"instance_name"`
|
|
InstanceTagline string `json:"instance_tagline"`
|
|
LogoURL string `json:"logo_url"`
|
|
LogoHorizontalURL string `json:"logo_horizontal_url"`
|
|
FaviconURL string `json:"favicon_url"`
|
|
}
|
|
|
|
// SetBrandingRequest is the input for set_branding. Both fields are
|
|
// optional; nil leaves the existing value untouched. Asset uploads are
|
|
// deliberately excluded from MCP — admins use the Settings UI for that.
|
|
type SetBrandingRequest struct {
|
|
InstanceName *string `json:"instance_name,omitempty" jsonschema:"New instance display name (replaces \"LocalAI\" in headers, footers, and the browser tab). Pass an empty string to reset to default."`
|
|
InstanceTagline *string `json:"instance_tagline,omitempty" jsonschema:"Optional short subtitle shown beneath the instance name. Pass an empty string to clear."`
|
|
}
|
|
|
|
// UsageStatsQuery is the input for get_usage_stats. UserID is optional;
|
|
// when empty the tool returns the calling user's own usage in auth-on
|
|
// mode, or the synthetic local user's usage in single-user no-auth
|
|
// mode. Admins (or the local user) may pass UserID to inspect another
|
|
// user; the LocalAIClient implementation enforces the role check.
|
|
type UsageStatsQuery struct {
|
|
Period string `json:"period,omitempty" jsonschema:"Time window. One of: day, week, month, all. Defaults to month."`
|
|
UserID string `json:"user_id,omitempty" jsonschema:"Optional user id to query. Empty = caller's own usage. Querying another user requires admin role."`
|
|
All bool `json:"all,omitempty" jsonschema:"When true, returns the cluster-wide /api/usage/all view (admin-only when auth is on)."`
|
|
}
|
|
|
|
// UsageStats is the response shape for get_usage_stats. Mirrors what
|
|
// /api/usage and /api/usage/all return so the LLM can correlate
|
|
// dashboard numbers with what it pulls via MCP.
|
|
type UsageStats struct {
|
|
Viewer UsageViewer `json:"viewer"`
|
|
Period string `json:"period"`
|
|
Totals UsageTotals `json:"totals"`
|
|
Buckets []UsageBucket `json:"buckets"`
|
|
}
|
|
|
|
type UsageViewer struct {
|
|
ID string `json:"id"`
|
|
Name string `json:"name"`
|
|
Role string `json:"role,omitempty"`
|
|
}
|
|
|
|
type UsageTotals struct {
|
|
PromptTokens int64 `json:"prompt_tokens"`
|
|
CompletionTokens int64 `json:"completion_tokens"`
|
|
TotalTokens int64 `json:"total_tokens"`
|
|
RequestCount int64 `json:"request_count"`
|
|
}
|
|
|
|
type UsageBucket struct {
|
|
Bucket string `json:"bucket"`
|
|
Model string `json:"model"`
|
|
UserID string `json:"user_id,omitempty"`
|
|
UserName string `json:"user_name,omitempty"`
|
|
PromptTokens int64 `json:"prompt_tokens"`
|
|
CompletionTokens int64 `json:"completion_tokens"`
|
|
TotalTokens int64 `json:"total_tokens"`
|
|
RequestCount int64 `json:"request_count"`
|
|
}
|
|
|
|
// ---- PII / sensitive data tools ----
|
|
|
|
// PIIPattern is one row in the list_pii_patterns response.
|
|
type PIIPattern struct {
|
|
ID string `json:"id"`
|
|
Description string `json:"description"`
|
|
Action string `json:"action"` // mask | block | allow
|
|
MaxMatchLength int `json:"max_match_length"`
|
|
}
|
|
|
|
// PIIEventsQuery filters get_pii_events.
|
|
type PIIEventsQuery struct {
|
|
CorrelationID string `json:"correlation_id,omitempty" jsonschema:"Optional X-Correlation-ID join key (binds events to the request and usage record)."`
|
|
UserID string `json:"user_id,omitempty" jsonschema:"Optional user id to scope the query."`
|
|
PatternID string `json:"pattern_id,omitempty" jsonschema:"Optional pattern id (e.g. email, ssn)."`
|
|
Limit int `json:"limit,omitempty" jsonschema:"Maximum events. Defaults to 100."`
|
|
}
|
|
|
|
// PIIEvent is the LLM-facing view of one redaction record. The matched
|
|
// value is never exposed; admins audit by hash_prefix.
|
|
type PIIEvent struct {
|
|
ID string `json:"id"`
|
|
CorrelationID string `json:"correlation_id"`
|
|
UserID string `json:"user_id"`
|
|
Direction string `json:"direction"`
|
|
PatternID string `json:"pattern_id"`
|
|
ByteOffset int `json:"byte_offset"`
|
|
Length int `json:"length"`
|
|
HashPrefix string `json:"hash_prefix"`
|
|
Action string `json:"action"`
|
|
CreatedAt string `json:"created_at"`
|
|
}
|
|
|
|
// PIIRedactTestRequest is the input for test_pii_redaction.
|
|
type PIIRedactTestRequest struct {
|
|
Text string `json:"text" jsonschema:"The candidate text. Will be run through the redactor without recording an event."`
|
|
}
|
|
|
|
// PIIRedactTestResult is the output for test_pii_redaction. spans
|
|
// describes where the redactor matched; redacted is the text after
|
|
// applying mask actions; blocked / masked flag what was done.
|
|
type PIIRedactTestResult struct {
|
|
Redacted string `json:"redacted"`
|
|
Spans []PIIEventSpan `json:"spans"`
|
|
Blocked bool `json:"blocked"`
|
|
Masked bool `json:"masked"`
|
|
}
|
|
|
|
type PIIEventSpan struct {
|
|
Start int `json:"start"`
|
|
End int `json:"end"`
|
|
Pattern string `json:"pattern"`
|
|
HashPrefix string `json:"hash_prefix"`
|
|
}
|
|
|
|
// PIIPatternActionUpdate is the input for set_pii_pattern_action.
|
|
// At least one of Action or Disabled must be set. Mutations are
|
|
// transient by default — call persist_pii_patterns to flush them
|
|
// to runtime_settings.json so the next start re-applies them.
|
|
type PIIPatternActionUpdate struct {
|
|
ID string `json:"id" jsonschema:"Pattern id to mutate (e.g. email, ssn, credit_card, api_key_prefix)."`
|
|
Action string `json:"action,omitempty" jsonschema:"New action: mask, block, or allow. Optional — omit to leave the action unchanged."`
|
|
Disabled *bool `json:"disabled,omitempty" jsonschema:"Set true to skip this pattern entirely; false to re-enable. Optional — omit to leave enabled-state unchanged."`
|
|
}
|
|
|
|
// MiddlewareStatus is the aggregated /api/middleware/status payload —
|
|
// the React Middleware page renders this in one go. Routing is a
|
|
// placeholder until subsystem 2 lands.
|
|
type MiddlewareStatus struct {
|
|
PII MiddlewarePIIStatus `json:"pii"`
|
|
Router MiddlewareRouterStatus `json:"router"`
|
|
}
|
|
|
|
// MiddlewarePIIStatus shows what the redactor is doing right now and
|
|
// which models opt in. enabled_globally=false means --disable-pii.
|
|
type MiddlewarePIIStatus struct {
|
|
EnabledGlobally bool `json:"enabled_globally"`
|
|
Reason string `json:"reason,omitempty"`
|
|
DefaultEnabledForBackends []string `json:"default_enabled_for_backends,omitempty"`
|
|
Patterns []PIIPattern `json:"patterns"`
|
|
Models []MiddlewarePIIModel `json:"models"`
|
|
RecentEventCount int `json:"recent_event_count"`
|
|
}
|
|
|
|
// MiddlewarePIIModel is one model row in the per-model PII table.
|
|
type MiddlewarePIIModel struct {
|
|
Name string `json:"name"`
|
|
Backend string `json:"backend"`
|
|
Enabled bool `json:"enabled"`
|
|
Explicit bool `json:"explicit"` // Did YAML set Enabled, or did the backend prefix decide?
|
|
DefaultForBackend bool `json:"default_for_backend"` // Backend matches the auto-on rule (proxy-*).
|
|
Overrides map[string]string `json:"overrides,omitempty"`
|
|
}
|
|
|
|
// MiddlewareRouterStatus is the placeholder shape the Routing tab
|
|
// reads. Subsystem 2 fills in Models with real RouterDecision rows.
|
|
type MiddlewareRouterStatus struct {
|
|
Configured bool `json:"configured"`
|
|
Models []string `json:"models"`
|
|
Note string `json:"note,omitempty"`
|
|
}
|
|
|
|
// RouterDecisionsQuery filters get_router_decisions.
|
|
type RouterDecisionsQuery struct {
|
|
CorrelationID string `json:"correlation_id,omitempty" jsonschema:"Optional X-Correlation-ID join key (binds decisions to the request and usage record)."`
|
|
UserID string `json:"user_id,omitempty" jsonschema:"Optional user id to scope the query."`
|
|
RouterModel string `json:"router_model,omitempty" jsonschema:"Optional router model name to filter by (e.g. smart-router)."`
|
|
Limit int `json:"limit,omitempty" jsonschema:"Maximum decisions. Defaults to 100."`
|
|
}
|
|
|
|
// RouterDecision is the LLM-facing view of one routing decision. The
|
|
// prompt is NEVER stored; admins audit by hash if they need to dedupe
|
|
// recurring routing patterns.
|
|
type RouterDecision struct {
|
|
ID string `json:"id"`
|
|
CorrelationID string `json:"correlation_id"`
|
|
UserID string `json:"user_id"`
|
|
RouterModel string `json:"router_model"`
|
|
RequestedModel string `json:"requested_model"`
|
|
ServedModel string `json:"served_model"`
|
|
Classifier string `json:"classifier"`
|
|
Label string `json:"label"`
|
|
Score float64 `json:"score"`
|
|
LatencyMs int64 `json:"latency_ms"`
|
|
Cached bool `json:"cached"`
|
|
CreatedAt string `json:"created_at"`
|
|
}
|
|
|
|
// VRAMEstimateRequest is the input for vram_estimate. The output type is
|
|
// pkg/vram.EstimateResult — used directly via the LocalAIClient interface
|
|
// so the LLM sees the same shape (size_bytes/size_display/vram_bytes/
|
|
// vram_display) that the REST endpoint returns.
|
|
type VRAMEstimateRequest struct {
|
|
ModelName string `json:"model_name" jsonschema:"Installed model name."`
|
|
ContextSize int `json:"context_size,omitempty" jsonschema:"Context size in tokens."`
|
|
GPULayers int `json:"gpu_layers,omitempty" jsonschema:"Number of layers to offload to GPU. -1 for all."`
|
|
KVQuantBits int `json:"kv_quant_bits,omitempty" jsonschema:"KV cache quantization bits (e.g. 4, 8, 16)."`
|
|
}
|