mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 13:10:23 -04:00
* fix(http): close 0.0.0.0/[::] SSRF bypass in /api/cors-proxy The CORS proxy carried its own private-network blocklist (RFC 1918 + a handful of IPv6 ranges) instead of using the same classification as pkg/utils/urlfetch.go. The hand-rolled list missed 0.0.0.0/8 and ::/128, both of which Linux routes to localhost — so any user with FeatureMCP (default-on for new users) could reach LocalAI's own listener and any other service bound to 0.0.0.0:port via: GET /api/cors-proxy?url=http://0.0.0.0:8080/... GET /api/cors-proxy?url=http://[::]:8080/... Replace the custom check with utils.IsPublicIP (Go stdlib IsLoopback / IsLinkLocalUnicast / IsPrivate / IsUnspecified, plus IPv4-mapped IPv6 unmasking) and add an upfront hostname rejection for localhost, *.local, and the cloud metadata aliases so split-horizon DNS can't paper over the IP check. The IP-pinning DialContext is unchanged: the validated IP from the single resolution is reused for the connection, so DNS rebinding still cannot swap a public answer for a private one between validate and dial. Regression tests cover 0.0.0.0, 0.0.0.0:PORT, [::], ::ffff:127.0.0.1, ::ffff:10.0.0.1, file://, gopher://, ftp://, localhost, 127.0.0.1, 10.0.0.1, 169.254.169.254, metadata.google.internal. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(downloader): verify SHA before promoting temp file to final path DownloadFileWithContext renamed the .partial file to its final name *before* checking the streamed SHA, so a hash mismatch returned an error but left the tampered file at filePath. Subsequent code that operated on filePath (a backend launcher, a YAML loader, a re-download that finds the file already present and skips) would consume the attacker-supplied bytes. Reorder: verify the streamed hash first, remove the .partial on mismatch, then rename. The streamed hash is computed during io.Copy so no second read is needed. While here, raise the empty-SHA case from a Debug log to a Warn so "this download had no integrity check" is visible at the default log level. Backend installs currently pass through with no digest; the warning makes that footprint observable without changing behaviour. Regression test asserts os.IsNotExist on the destination after a deliberate SHA mismatch. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(auth): require email_verified for OIDC admin promotion extractOIDCUserInfo read the ID token's "email" claim but never inspected "email_verified". With LOCALAI_ADMIN_EMAIL set, an attacker who could register on the configured OIDC IdP under that email (some IdPs accept self-supplied unverified emails) inherited admin role: - first login: AssignRole(tx, email, adminEmail) → RoleAdmin - re-login: MaybePromote(db, user, adminEmail) → flip to RoleAdmin Add EmailVerified to oauthUserInfo, parse email_verified from the OIDC claims (default false on absence so an IdP that omits the claim cannot short-circuit the gate), and substitute "" for the role-decision email when verified=false via emailForRoleDecision. The user record still stores the unverified email for display. GitHub's path defaults EmailVerified=true: GitHub only returns a public profile email after verification, and fetchGitHubPrimaryEmail explicitly filters to Verified=true. Regression tests cover both the helper contract and integration with AssignRole, including the bootstrap "first user" branch that would otherwise mask the gate. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(cli): refuse public bind when no auth backend is configured When neither an auth DB nor a static API key is set, the auth middleware passes every request through. That is fine for a developer laptop, a home LAN, or a Tailnet — the network itself is the trust boundary. It is not fine on a public IP, where every model install, settings change, and admin endpoint becomes reachable from the internet. Refuse to start in that exact configuration. Loopback, RFC 1918, RFC 4193 ULA, link-local, and RFC 6598 CGNAT (Tailscale's default range) all count as trusted; wildcard binds (`:port`, `0.0.0.0`, `[::]`) are accepted only when every host interface is in one of those ranges. Hostnames are resolved and treated as trusted only when every answer is. A new --allow-insecure-public-bind / LOCALAI_ALLOW_INSECURE_PUBLIC_BIND flag opts out for deployments that gate access externally (a reverse proxy enforcing auth, a mesh ACL, etc.). The error message lists this plus the three constructive alternatives (bind a private interface, enable --auth, set --api-keys). The interface enumeration goes through a package-level interfaceAddrsFn var so tests can simulate cloud-VM, home-LAN, Tailscale-only, and enumeration-failure topologies without poking at the real network stack. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): regression-test the localai_assistant admin gate ChatEndpoint already rejects metadata.localai_assistant=true from a non-admin caller, but the gate was open-coded inline with no direct test coverage. The chat route is FeatureChat-gated (default-on), and the assistant's in-process MCP server can install/delete models and edit configs — the wrong handler change would silently turn the LLM into a confused deputy. Extract the gate into requireAssistantAccess(c, authEnabled) and pin its behaviour: auth disabled is a no-op, unauthenticated is 403, RoleUser is 403, RoleAdmin and the synthetic legacy-key admin are admitted. No behaviour change in the production path. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): assert every API route is auth-classified The auth middleware classifies path prefixes (/api/, /v1/, /models/, etc.) as protected and treats anything else as a static-asset passthrough. A new endpoint shipped under a brand-new prefix — or a new path that simply isn't on the prefix allowlist — would be reachable anonymously. Walk every route registered by API() with auth enabled and a fresh in-memory database (no users, no keys), and assert each API-prefixed route returns 401 / 404 / 405 to an anonymous request. Public surfaces (/api/auth/*, /api/branding, /api/node/* token-authenticated routes, /healthz, branding asset server, generated-content server, static assets) are explicit allowlist entries with comments justifying them. Build-tagged 'auth' so it runs against the SQLite-backed auth DB (matches the existing auth suite). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): pin agent endpoint per-user isolation contract agents.go's getUserID / effectiveUserID / canImpersonateUser / wantsAllUsers helpers are the single trust boundary for cross-user access on agent, agent-jobs, collections, and skills routes. A regression there is the difference between "regular user reads their own data" and "regular user reads anyone's data via ?user_id=victim". Lock in the contract: - effectiveUserID ignores ?user_id= for unauthenticated and RoleUser - effectiveUserID honours it for RoleAdmin and ProviderAgentWorker - wantsAllUsers requires admin AND the literal "true" string - canImpersonateUser is admin OR agent-worker, never plain RoleUser No production change — this commit only adds tests. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(downloader): drop redundant stat in removePartialFile The stat-then-remove pattern is a TOCTOU window and a wasted syscall — os.Remove already returns ErrNotExist for the missing-file case, so trust that and treat it as a no-op. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): redact secrets from trace buffer and distribution-token logs The /api/traces buffer captured Authorization, Cookie, Set-Cookie, and API-key headers verbatim from every request when tracing was enabled. The endpoint is admin-only but the buffer is reachable via any heap-style introspection and the captured tokens otherwise outlive the request. Strip those header values at capture time. Body redaction is left to a follow-up — the prompts are usually the operator's own and JSON-walking is invasive. Distribution tokens were also logged in plaintext from core/explorer/discovery.go; logs forward to syslog/journald and outlive the token. Redact those to a short prefix/suffix instead. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(auth): rate-limit OAuth callbacks separately from password endpoints The shared 5/min/IP limit on auth endpoints is right for password-style flows but too tight for OAuth callbacks: corporate SSO funnels many real users through one outbound IP and would trip the limit. Add a separate 60/min/IP limiter for /api/auth/{github,oidc}/callback so callbacks are bounded against floods without breaking shared-IP deployments. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(gallery): verify backend tarball sha256 when set in gallery entry GalleryBackend gained an optional sha256 field; the install path now threads it through to the existing downloader hash-verify (which already streams, verifies, and rolls back on mismatch). Galleries without sha256 keep working; the empty-SHA path still emits the existing "downloading without integrity check" warning. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): pin CSRF coverage on multipart endpoints The CSRF middleware in app.go is global (e.Use) so it covers every multipart upload route — branding assets, fine-tune datasets, audio transforms, agent collections. Pin that contract: cross-site multipart POSTs are rejected; same-origin / same-site / API-key clients are not. Also pins the SameSite=Lax fallback path the skipper relies on when Sec-Fetch-Site is absent. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(http): XSS hardening — CSP headers, safe href, base-href escape, SVG sandbox Several closely related XSS-prevention changes spanning the SPA shell, the React UI, and the branding asset server: - New SecurityHeaders middleware sets CSP, X-Content-Type-Options, X-Frame-Options, and Referrer-Policy on every response. The CSP keeps script-src permissive because the Vite bundle relies on inline + eval'd scripts; tightening that requires moving to a nonce-based policy. - The <base href> injection in the SPA shell escaped attacker-controllable Host / X-Forwarded-Host headers — a single quote in the host header broke out of the attribute. Pass through SecureBaseHref (html.EscapeString). - Three React sinks rendering untrusted content via dangerouslySetInnerHTML switch to text-node rendering with whiteSpace: pre-wrap: user message bodies in Chat.jsx and AgentChat.jsx, and the agent activity log in AgentChat.jsx. The hand-rolled escape on the agent user-message variant is replaced by the same plain-text path. - New safeHref util collapses non-allowlisted URI schemes (most importantly javascript:) to '#'. Applied to gallery `<a href={url}>` links in Models / Backends / Manage and to canvas artifact links — these come from gallery JSON or assistant tool calls and must be treated as untrusted. - The branding asset server attaches a sandbox CSP plus same-origin CORP to .svg responses. The React UI loads logos via <img>, but the same URL is also reachable via direct navigation; this prevents script execution if a hostile SVG slipped past upload validation. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(http): bound HTTP server with read-header and idle timeouts A net/http server with no timeouts is trivially Slowloris-able and leaks idle keep-alive connections. Set ReadHeaderTimeout (30s) to plug the slow-headers attack and IdleTimeout (120s) to cap keep-alive sockets. ReadTimeout and WriteTimeout stay at 0 because request bodies can be multi-GB model uploads and SSE / chat completions stream for many minutes; operators who need tighter per-request bounds should terminate slow clients at a reverse proxy. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(auth): pin PUT /api/auth/profile field-tampering contract The handler uses an explicit local body struct (only name and avatar_url) plus a gorm Updates(map) with a column allowlist, so an attacker posting {"role":"admin","email":"...","password_hash":"..."} can't mass-assign those fields. Lock that down with a regression test so a future "let's just c.Bind(&user)" refactor breaks loudly. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(services): strip directory components from multipart upload filenames UploadDataset and UploadToCollectionForUser took the raw multipart file.Filename and joined it into a destination path. The fine-tune upload was incidentally safe because of a UUID prefix that fused any leading '..' to a literal segment, but the protection is fragile. UploadToCollectionForUser handed the filename to a vendored backend without sanitising at all. Strip to filepath.Base at both boundaries and reject the trivial unsafe values ("", ".", "..", "/"). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): validate persisted MCP server entries on load localStorage is shared across same-origin pages; an XSS that lands once can poison persisted MCP server config to attempt header injection or to feed a non-http URL into the fetch path on subsequent loads. Validate every entry: types must match, URL must parse with http(s) scheme, header keys/values must be control-char-free. Drop anything that doesn't fit. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): close X-Forwarded-Prefix open redirect The reverse-proxy support concatenated X-Forwarded-Prefix into the redirect target without validation, so a forged header value of "//evil.com" turned the SPA-shell redirect helper at /, /browse, and /browse/* into a 301 to //evil.com/app. The path-strip middleware had the same shape on its prefix-trailing-slash redirect. Add SafeForwardedPrefix at the middleware boundary: must start with a single '/', no protocol-relative '//' opener, no scheme, no backslash, no control characters. Apply at both consumers; misconfig trips the validator and the header is dropped. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): refuse wildcard CORS when LOCALAI_CORS=true with empty allowlist When LOCALAI_CORS=true but LOCALAI_CORS_ALLOW_ORIGINS was empty, Echo's CORSWithConfig saw an empty allow-list and fell back to its default AllowOrigins=["*"]. An operator who flipped the strict-CORS feature flag without populating the list got the opposite of what they asked for. Echo never sets Allow-Credentials: true so this isn't directly exploitable (cookies aren't sent under wildcard CORS), but the misconfiguration trap is worth closing. Skip the registration and warn. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(auth): zxcvbn password strength check with user-acknowledged override The previous policy was len < 8, which let through "Password1" and the rest of the credential-stuffing corpus. LocalAI has no second factor yet, so the bar needs to sit higher. Add ValidatePasswordStrength using github.com/timbutler/zxcvbn (an actively-maintained fork of the trustelem port; v1.0.4, April 2024): - min 12 chars, max 72 (bcrypt's truncation point) - reject NUL bytes (some bcrypt callers truncate at the first NUL) - require zxcvbn score >= 3 ("safely unguessable, ~10^8 guesses to break"); the hint list ["localai", "local-ai", "admin"] penalises passwords built from the app's own branding zxcvbn produces false positives sometimes (a strong-looking password that happens to match a dictionary word) and operators occasionally need to set a known-weak password (kiosk demos, CI rigs). Add an acknowledgement path: PasswordPolicy{AllowWeak: true} skips the entropy check while still enforcing the hard rules. The structured PasswordErrorResponse marks weak-password rejections as Overridable so the UI can surface a "use this anyway" checkbox. Wired through register, self-service password change, and admin password reset on both the server and the React UI. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): drop HTML5 minLength on new-password inputs minLength={12} on the new-password input let the browser block the form submit silently before any JS or network call ran. The browser focused the field, showed a brief native tooltip, and that was that — no toast, no fetch, no clue. Reproducible by typing fewer than 12 chars on the second password change of a session. The JS-level length check in handleSubmit already shows a toast and the server rejects with a structured error, so the HTML5 attribute was redundant defence anyway. Drop it. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): bundle Geist fonts locally instead of fetching from Google The new CSP correctly refused to apply styles from fonts.googleapis.com because style-src is locked to 'self' and 'unsafe-inline'. Loosening the CSP would defeat its purpose; the right fix is to stop reaching out to a third-party CDN for fonts on every page load. Add @fontsource-variable/geist and @fontsource-variable/geist-mono as npm deps and import them once at boot. Drop the <link rel="preconnect"> and external stylesheet from index.html. Side benefit: no third-party tracking via Referer / IP on every UI load, no failure mode when offline / behind a captive portal. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): refresh i18n strings to reflect 12-char password minimum The translations still said "at least 8 characters" everywhere — the client-side toast on a too-short password change told the user the wrong floor. Update tooShort and newPasswordPlaceholder / newPasswordDescription across all five locales (en, es, it, de, zh-CN) to match the real ValidatePasswordStrength rule. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(auth): make password length-floor overridable like the entropy check The 12-char minimum was a policy choice, not a technical invariant — only "non-empty", "<= 72 bytes", and "no NUL bytes" are real bcrypt constraints. Treating length-12 as a hard rule was inconsistent with the entropy check (already overridable) and friction for use cases where the account is just a name on a session, not a security boundary (single-user kiosk, CI rig, lab demo). Restructure ValidatePasswordStrength: - Hard rules (always enforced): non-empty, <= MaxPasswordLength, no NUL byte - Policy rules (skipped when AllowWeak=true): length >= 12, zxcvbn score >= 3 PasswordError now marks password_too_short as Overridable too. The React forms generalised from `error_code === 'password_too_weak'` to `overridable === true`, and the JS-side preflight length checks were removed (server is source of truth, returns the same checkbox flow). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
1317 lines
46 KiB
Go
1317 lines
46 KiB
Go
package openai
|
|
|
|
import (
|
|
"encoding/json"
|
|
"fmt"
|
|
"net/http"
|
|
"time"
|
|
|
|
"github.com/google/uuid"
|
|
"github.com/labstack/echo/v4"
|
|
"github.com/mudler/LocalAI/core/backend"
|
|
"github.com/mudler/LocalAI/core/config"
|
|
mcpTools "github.com/mudler/LocalAI/core/http/endpoints/mcp"
|
|
"github.com/mudler/LocalAI/core/http/middleware"
|
|
"github.com/mudler/LocalAI/core/schema"
|
|
"github.com/mudler/LocalAI/pkg/functions"
|
|
reason "github.com/mudler/LocalAI/pkg/reasoning"
|
|
|
|
"github.com/mudler/LocalAI/core/templates"
|
|
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
|
|
"github.com/mudler/LocalAI/pkg/model"
|
|
|
|
"github.com/mudler/xlog"
|
|
)
|
|
|
|
// hasSystemMessage reports whether the message slice already contains a
|
|
// system-role message — used to avoid clobbering a caller-supplied system
|
|
// prompt when the LocalAI Assistant modality is on.
|
|
func hasSystemMessage(messages []schema.Message) bool {
|
|
for _, m := range messages {
|
|
if m.Role == "system" {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
|
|
// mergeToolCallDeltas merges streaming tool call deltas into complete tool calls.
|
|
// In SSE streaming, a single tool call arrives as multiple chunks sharing the same Index:
|
|
// the first chunk carries the ID, Type, and Name; subsequent chunks append to Arguments.
|
|
func mergeToolCallDeltas(existing []schema.ToolCall, deltas []schema.ToolCall) []schema.ToolCall {
|
|
byIndex := make(map[int]int, len(existing)) // tool call Index -> position in slice
|
|
for i, tc := range existing {
|
|
byIndex[tc.Index] = i
|
|
}
|
|
for _, d := range deltas {
|
|
pos, found := byIndex[d.Index]
|
|
if !found {
|
|
byIndex[d.Index] = len(existing)
|
|
existing = append(existing, d)
|
|
continue
|
|
}
|
|
// Merge into existing entry
|
|
tc := &existing[pos]
|
|
if d.ID != "" {
|
|
tc.ID = d.ID
|
|
}
|
|
if d.Type != "" {
|
|
tc.Type = d.Type
|
|
}
|
|
if d.FunctionCall.Name != "" {
|
|
tc.FunctionCall.Name = d.FunctionCall.Name
|
|
}
|
|
tc.FunctionCall.Arguments += d.FunctionCall.Arguments
|
|
}
|
|
return existing
|
|
}
|
|
|
|
// ChatEndpoint is the OpenAI Completion API endpoint https://platform.openai.com/docs/api-reference/chat/create
|
|
// @Summary Generate a chat completions for a given prompt and model.
|
|
// @Tags inference
|
|
// @Param request body schema.OpenAIRequest true "query params"
|
|
// @Success 200 {object} schema.OpenAIResponse "Response"
|
|
// @Router /v1/chat/completions [post]
|
|
func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator *templates.Evaluator, startupOptions *config.ApplicationConfig, natsClient mcpTools.MCPNATSClient, assistantHolder *mcpTools.LocalAIAssistantHolder) echo.HandlerFunc {
|
|
process := func(s string, req *schema.OpenAIRequest, config *config.ModelConfig, loader *model.ModelLoader, responses chan schema.OpenAIResponse, extraUsage bool, id string, created int) error {
|
|
initialMessage := schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: req.Model, // we have to return what the user sent here, due to OpenAI spec.
|
|
Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0, FinishReason: nil}},
|
|
Object: "chat.completion.chunk",
|
|
}
|
|
responses <- initialMessage
|
|
|
|
// Detect if thinking token is already in prompt or template
|
|
// When UseTokenizerTemplate is enabled, predInput is empty, so we check the template
|
|
var template string
|
|
if config.TemplateConfig.UseTokenizerTemplate {
|
|
template = config.GetModelTemplate()
|
|
} else {
|
|
template = s
|
|
}
|
|
thinkingStartToken := reason.DetectThinkingStartToken(template, &config.ReasoningConfig)
|
|
extractor := reason.NewReasoningExtractor(thinkingStartToken, config.ReasoningConfig)
|
|
|
|
_, _, _, err := ComputeChoices(req, s, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, tokenUsage backend.TokenUsage) bool {
|
|
var reasoningDelta, contentDelta string
|
|
|
|
// Always keep the Go-side extractor in sync with raw tokens so it
|
|
// can serve as fallback for backends without an autoparser (e.g. vLLM).
|
|
goReasoning, goContent := extractor.ProcessToken(s)
|
|
|
|
// When C++ autoparser chat deltas are available, prefer them — they
|
|
// handle model-specific formats (Gemma 4, etc.) without Go-side tags.
|
|
// Otherwise fall back to Go-side extraction.
|
|
if tokenUsage.HasChatDeltaContent() {
|
|
rawReasoning, cd := tokenUsage.ChatDeltaReasoningAndContent()
|
|
contentDelta = cd
|
|
reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
|
|
} else {
|
|
reasoningDelta = goReasoning
|
|
contentDelta = goContent
|
|
}
|
|
|
|
usage := schema.OpenAIUsage{
|
|
PromptTokens: tokenUsage.Prompt,
|
|
CompletionTokens: tokenUsage.Completion,
|
|
TotalTokens: tokenUsage.Prompt + tokenUsage.Completion,
|
|
}
|
|
if extraUsage {
|
|
usage.TimingTokenGeneration = tokenUsage.TimingTokenGeneration
|
|
usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
|
|
}
|
|
|
|
delta := &schema.Message{}
|
|
if contentDelta != "" {
|
|
delta.Content = &contentDelta
|
|
}
|
|
if reasoningDelta != "" {
|
|
delta.Reasoning = &reasoningDelta
|
|
}
|
|
|
|
resp := schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: req.Model, // we have to return what the user sent here, due to OpenAI spec.
|
|
Choices: []schema.Choice{{Delta: delta, Index: 0, FinishReason: nil}},
|
|
Object: "chat.completion.chunk",
|
|
Usage: usage,
|
|
}
|
|
|
|
responses <- resp
|
|
return true
|
|
})
|
|
close(responses)
|
|
return err
|
|
}
|
|
processTools := func(noAction string, prompt string, req *schema.OpenAIRequest, config *config.ModelConfig, loader *model.ModelLoader, responses chan schema.OpenAIResponse, extraUsage bool, id string, created int, textContentToReturn *string) error {
|
|
// Detect if thinking token is already in prompt or template
|
|
var template string
|
|
if config.TemplateConfig.UseTokenizerTemplate {
|
|
template = config.GetModelTemplate()
|
|
} else {
|
|
template = prompt
|
|
}
|
|
thinkingStartToken := reason.DetectThinkingStartToken(template, &config.ReasoningConfig)
|
|
extractor := reason.NewReasoningExtractor(thinkingStartToken, config.ReasoningConfig)
|
|
|
|
result := ""
|
|
lastEmittedCount := 0
|
|
sentInitialRole := false
|
|
sentReasoning := false
|
|
hasChatDeltaToolCalls := false
|
|
hasChatDeltaContent := false
|
|
|
|
_, tokenUsage, chatDeltas, err := ComputeChoices(req, prompt, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool {
|
|
result += s
|
|
|
|
// Track whether ChatDeltas from the C++ autoparser contain
|
|
// tool calls or content, so the retry decision can account for them.
|
|
for _, d := range usage.ChatDeltas {
|
|
if len(d.ToolCalls) > 0 {
|
|
hasChatDeltaToolCalls = true
|
|
}
|
|
if d.Content != "" {
|
|
hasChatDeltaContent = true
|
|
}
|
|
}
|
|
|
|
var reasoningDelta, contentDelta string
|
|
|
|
goReasoning, goContent := extractor.ProcessToken(s)
|
|
|
|
if usage.HasChatDeltaContent() {
|
|
rawReasoning, cd := usage.ChatDeltaReasoningAndContent()
|
|
contentDelta = cd
|
|
reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
|
|
} else {
|
|
reasoningDelta = goReasoning
|
|
contentDelta = goContent
|
|
}
|
|
|
|
// Emit reasoning deltas in their own SSE chunks before any tool-call chunks
|
|
// (OpenAI spec: reasoning and tool_calls never share a delta)
|
|
if reasoningDelta != "" {
|
|
responses <- schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: req.Model,
|
|
Choices: []schema.Choice{{
|
|
Delta: &schema.Message{Reasoning: &reasoningDelta},
|
|
Index: 0,
|
|
}},
|
|
Object: "chat.completion.chunk",
|
|
}
|
|
sentReasoning = true
|
|
}
|
|
|
|
// Stream content deltas (cleaned of reasoning tags) while no tool calls
|
|
// have been detected. Once the incremental parser finds tool calls,
|
|
// content stops — per OpenAI spec, content and tool_calls don't mix.
|
|
if lastEmittedCount == 0 && contentDelta != "" {
|
|
if !sentInitialRole {
|
|
responses <- schema.OpenAIResponse{
|
|
ID: id, Created: created, Model: req.Model,
|
|
Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0}},
|
|
Object: "chat.completion.chunk",
|
|
}
|
|
sentInitialRole = true
|
|
}
|
|
responses <- schema.OpenAIResponse{
|
|
ID: id, Created: created, Model: req.Model,
|
|
Choices: []schema.Choice{{
|
|
Delta: &schema.Message{Content: &contentDelta},
|
|
Index: 0,
|
|
}},
|
|
Object: "chat.completion.chunk",
|
|
}
|
|
}
|
|
|
|
// Try incremental XML parsing for streaming support using iterative parser
|
|
// This allows emitting partial tool calls as they're being generated
|
|
cleanedResult := functions.CleanupLLMResult(result, config.FunctionsConfig)
|
|
|
|
// Determine XML format from config
|
|
var xmlFormat *functions.XMLToolCallFormat
|
|
if config.FunctionsConfig.XMLFormat != nil {
|
|
xmlFormat = config.FunctionsConfig.XMLFormat
|
|
} else if config.FunctionsConfig.XMLFormatPreset != "" {
|
|
xmlFormat = functions.GetXMLFormatPreset(config.FunctionsConfig.XMLFormatPreset)
|
|
}
|
|
|
|
// Use iterative parser for streaming (partial parsing enabled)
|
|
// Try XML parsing first
|
|
partialResults, parseErr := functions.ParseXMLIterative(cleanedResult, xmlFormat, true)
|
|
if parseErr == nil && len(partialResults) > 0 {
|
|
// Emit new XML tool calls that weren't emitted before
|
|
if len(partialResults) > lastEmittedCount {
|
|
for i := lastEmittedCount; i < len(partialResults); i++ {
|
|
toolCall := partialResults[i]
|
|
initialMessage := schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: req.Model,
|
|
Choices: []schema.Choice{{
|
|
Delta: &schema.Message{
|
|
Role: "assistant",
|
|
ToolCalls: []schema.ToolCall{
|
|
{
|
|
Index: i,
|
|
ID: id,
|
|
Type: "function",
|
|
FunctionCall: schema.FunctionCall{
|
|
Name: toolCall.Name,
|
|
},
|
|
},
|
|
},
|
|
},
|
|
Index: 0,
|
|
FinishReason: nil,
|
|
}},
|
|
Object: "chat.completion.chunk",
|
|
}
|
|
select {
|
|
case responses <- initialMessage:
|
|
default:
|
|
}
|
|
}
|
|
lastEmittedCount = len(partialResults)
|
|
}
|
|
} else {
|
|
// Try JSON tool call parsing for streaming.
|
|
// Only emit NEW tool calls (same guard as XML parser above).
|
|
jsonResults, jsonErr := functions.ParseJSONIterative(cleanedResult, true)
|
|
if jsonErr == nil && len(jsonResults) > lastEmittedCount {
|
|
for i := lastEmittedCount; i < len(jsonResults); i++ {
|
|
jsonObj := jsonResults[i]
|
|
name, ok := jsonObj["name"].(string)
|
|
if !ok || name == "" {
|
|
continue
|
|
}
|
|
args := "{}"
|
|
if argsVal, ok := jsonObj["arguments"]; ok {
|
|
if argsStr, ok := argsVal.(string); ok {
|
|
args = argsStr
|
|
} else {
|
|
argsBytes, _ := json.Marshal(argsVal)
|
|
args = string(argsBytes)
|
|
}
|
|
}
|
|
initialMessage := schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: req.Model,
|
|
Choices: []schema.Choice{{
|
|
Delta: &schema.Message{
|
|
Role: "assistant",
|
|
ToolCalls: []schema.ToolCall{
|
|
{
|
|
Index: i,
|
|
ID: id,
|
|
Type: "function",
|
|
FunctionCall: schema.FunctionCall{
|
|
Name: name,
|
|
Arguments: args,
|
|
},
|
|
},
|
|
},
|
|
},
|
|
Index: 0,
|
|
FinishReason: nil,
|
|
}},
|
|
Object: "chat.completion.chunk",
|
|
}
|
|
responses <- initialMessage
|
|
}
|
|
lastEmittedCount = len(jsonResults)
|
|
}
|
|
}
|
|
return true
|
|
},
|
|
func(attempt int) bool {
|
|
// After streaming completes: check if we got actionable content
|
|
cleaned := extractor.CleanedContent()
|
|
// Check for tool calls from chat deltas (will be re-checked after ComputeChoices,
|
|
// but we need to know here whether to retry).
|
|
// Also check ChatDelta flags — when the C++ autoparser is active,
|
|
// tool calls and content are delivered via ChatDeltas while the
|
|
// raw message is cleared. Without this check, we'd retry
|
|
// unnecessarily, losing valid results and concatenating output.
|
|
hasToolCalls := lastEmittedCount > 0 || hasChatDeltaToolCalls
|
|
hasContent := cleaned != "" || hasChatDeltaContent
|
|
if !hasContent && !hasToolCalls {
|
|
xlog.Warn("Streaming: backend produced only reasoning, retrying",
|
|
"reasoning_len", len(extractor.Reasoning()), "attempt", attempt+1)
|
|
extractor.ResetAndSuppressReasoning()
|
|
result = ""
|
|
lastEmittedCount = 0
|
|
sentInitialRole = false
|
|
hasChatDeltaToolCalls = false
|
|
hasChatDeltaContent = false
|
|
return true
|
|
}
|
|
return false
|
|
},
|
|
)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
// Try using pre-parsed tool calls from C++ autoparser (chat deltas)
|
|
var functionResults []functions.FuncCallResults
|
|
var reasoning string
|
|
|
|
if deltaToolCalls := functions.ToolCallsFromChatDeltas(chatDeltas); len(deltaToolCalls) > 0 {
|
|
xlog.Debug("[ChatDeltas] Using pre-parsed tool calls from C++ autoparser", "count", len(deltaToolCalls))
|
|
functionResults = deltaToolCalls
|
|
// Use content/reasoning from deltas too
|
|
*textContentToReturn = functions.ContentFromChatDeltas(chatDeltas)
|
|
reasoning = functions.ReasoningFromChatDeltas(chatDeltas)
|
|
} else {
|
|
// Fallback: parse tool calls from raw text (no chat deltas from backend)
|
|
xlog.Debug("[ChatDeltas] no pre-parsed tool calls, falling back to Go-side text parsing")
|
|
reasoning = extractor.Reasoning()
|
|
cleanedResult := extractor.CleanedContent()
|
|
*textContentToReturn = functions.ParseTextContent(cleanedResult, config.FunctionsConfig)
|
|
cleanedResult = functions.CleanupLLMResult(cleanedResult, config.FunctionsConfig)
|
|
functionResults = functions.ParseFunctionCall(cleanedResult, config.FunctionsConfig)
|
|
}
|
|
xlog.Debug("[ChatDeltas] final tool call decision", "tool_calls", len(functionResults), "text_content", *textContentToReturn)
|
|
// noAction is a sentinel "just answer" pseudo-function — not a real
|
|
// tool call. Scan the whole slice rather than only index 0 so we
|
|
// don't drop a real tool call that happens to follow a noAction
|
|
// entry, and so the default branch isn't entered with only noAction
|
|
// entries to emit as tool_calls.
|
|
noActionToRun := !hasRealCall(functionResults, noAction)
|
|
|
|
switch {
|
|
case noActionToRun:
|
|
usage := schema.OpenAIUsage{
|
|
PromptTokens: tokenUsage.Prompt,
|
|
CompletionTokens: tokenUsage.Completion,
|
|
TotalTokens: tokenUsage.Prompt + tokenUsage.Completion,
|
|
}
|
|
if extraUsage {
|
|
usage.TimingTokenGeneration = tokenUsage.TimingTokenGeneration
|
|
usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
|
|
}
|
|
|
|
var result string
|
|
if !sentInitialRole {
|
|
var hqErr error
|
|
result, hqErr = handleQuestion(config, functionResults, extractor.CleanedContent(), prompt)
|
|
if hqErr != nil {
|
|
xlog.Error("error handling question", "error", hqErr)
|
|
return hqErr
|
|
}
|
|
}
|
|
for _, chunk := range buildNoActionFinalChunks(
|
|
id, req.Model, created,
|
|
sentInitialRole, sentReasoning,
|
|
result, reasoning, usage,
|
|
) {
|
|
responses <- chunk
|
|
}
|
|
|
|
default:
|
|
for _, chunk := range buildDeferredToolCallChunks(
|
|
id, req.Model, created,
|
|
functionResults, lastEmittedCount,
|
|
sentInitialRole, *textContentToReturn,
|
|
sentReasoning, reasoning,
|
|
) {
|
|
responses <- chunk
|
|
}
|
|
}
|
|
|
|
close(responses)
|
|
return err
|
|
}
|
|
|
|
return func(c echo.Context) error {
|
|
var textContentToReturn string
|
|
id := uuid.New().String()
|
|
created := int(time.Now().Unix())
|
|
|
|
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.OpenAIRequest)
|
|
if !ok || input.Model == "" {
|
|
return echo.ErrBadRequest
|
|
}
|
|
|
|
extraUsage := c.Request().Header.Get("Extra-Usage") != ""
|
|
|
|
config, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
|
|
if !ok || config == nil {
|
|
return echo.ErrBadRequest
|
|
}
|
|
|
|
xlog.Debug("Chat endpoint configuration read", "config", config)
|
|
|
|
funcs := input.Functions
|
|
shouldUseFn := len(input.Functions) > 0 && config.ShouldUseFunctions()
|
|
strictMode := false
|
|
|
|
// MCP tool injection: when mcp_servers is set in metadata and model has MCP config
|
|
var mcpExecutor mcpTools.ToolExecutor
|
|
mcpServers := mcpTools.MCPServersFromMetadata(input.Metadata)
|
|
|
|
// LocalAI Assistant modality: an admin opted into the in-process MCP
|
|
// admin tool surface. Runs *before* the regular MCP block — when both
|
|
// are set, the assistant tools win (the admin cannot mix them with
|
|
// per-model MCP servers in the same chat session by design).
|
|
assistantMode := mcpTools.LocalAIAssistantFromMetadata(input.Metadata)
|
|
if assistantMode {
|
|
if err := requireAssistantAccess(c, startupOptions.Auth.Enabled); err != nil {
|
|
return err
|
|
}
|
|
// Read the disable flag live: an admin can flip it via /api/settings
|
|
// and the next request must see the change without a restart.
|
|
if startupOptions.DisableLocalAIAssistant {
|
|
return echo.NewHTTPError(http.StatusServiceUnavailable, "LocalAI Assistant is disabled on this server")
|
|
}
|
|
if assistantHolder == nil || !assistantHolder.HasTools() {
|
|
return echo.NewHTTPError(http.StatusServiceUnavailable, "LocalAI Assistant is not available on this server")
|
|
}
|
|
mcpExecutor = assistantHolder.Executor()
|
|
mcpFuncs, discErr := mcpExecutor.DiscoverTools(c.Request().Context())
|
|
if discErr != nil {
|
|
xlog.Error("Failed to discover LocalAI Assistant tools", "error", discErr)
|
|
return echo.NewHTTPError(http.StatusInternalServerError, "discover assistant tools: "+discErr.Error())
|
|
}
|
|
for _, fn := range mcpFuncs {
|
|
funcs = append(funcs, fn)
|
|
input.Tools = append(input.Tools, functions.Tool{Type: "function", Function: fn})
|
|
}
|
|
shouldUseFn = len(funcs) > 0 && config.ShouldUseFunctions()
|
|
|
|
// Prepend the embedded system prompt unless the caller supplied
|
|
// their own system message. Why: the prompt is what teaches the
|
|
// model the safety rules and recipes. If a caller already has a
|
|
// system message they're responsible for keeping the assistant
|
|
// safe, so we leave it alone.
|
|
if !hasSystemMessage(input.Messages) {
|
|
input.Messages = append([]schema.Message{{Role: "system", StringContent: assistantHolder.SystemPrompt()}}, input.Messages...)
|
|
}
|
|
|
|
xlog.Debug("LocalAI Assistant tools injected", "count", len(mcpFuncs))
|
|
}
|
|
|
|
// MCP prompt and resource injection (extracted before tool injection)
|
|
mcpPromptName, mcpPromptArgs := mcpTools.MCPPromptFromMetadata(input.Metadata)
|
|
mcpResourceURIs := mcpTools.MCPResourcesFromMetadata(input.Metadata)
|
|
|
|
if (len(mcpServers) > 0 || mcpPromptName != "" || len(mcpResourceURIs) > 0) && (config.MCP.Servers != "" || config.MCP.Stdio != "") {
|
|
remote, stdio, mcpErr := config.MCP.MCPConfigFromYAML()
|
|
if mcpErr == nil {
|
|
mcpExecutor = mcpTools.NewToolExecutor(c.Request().Context(), natsClient, config.Name, remote, stdio, mcpServers)
|
|
|
|
// Prompt and resource injection (pre-processing step — resolves locally regardless of distributed mode)
|
|
namedSessions, sessErr := mcpTools.NamedSessionsFromMCPConfig(config.Name, remote, stdio, mcpServers)
|
|
if sessErr == nil && len(namedSessions) > 0 {
|
|
mcpCtx, _ := mcpTools.InjectMCPContext(c.Request().Context(), namedSessions, mcpPromptName, mcpPromptArgs, mcpResourceURIs)
|
|
if mcpCtx != nil {
|
|
input.Messages = append(mcpCtx.PromptMessages, input.Messages...)
|
|
mcpTools.AppendResourceSuffix(input.Messages, mcpCtx.ResourceSuffix)
|
|
}
|
|
}
|
|
|
|
// Tool injection via executor
|
|
if mcpExecutor.HasTools() {
|
|
mcpFuncs, discErr := mcpExecutor.DiscoverTools(c.Request().Context())
|
|
if discErr == nil {
|
|
for _, fn := range mcpFuncs {
|
|
funcs = append(funcs, fn)
|
|
input.Tools = append(input.Tools, functions.Tool{Type: "function", Function: fn})
|
|
}
|
|
shouldUseFn = len(funcs) > 0 && config.ShouldUseFunctions()
|
|
xlog.Debug("MCP tools injected", "count", len(mcpFuncs), "total_funcs", len(funcs))
|
|
} else {
|
|
xlog.Error("Failed to discover MCP tools", "error", discErr)
|
|
}
|
|
}
|
|
} else {
|
|
xlog.Error("Failed to parse MCP config", "error", mcpErr)
|
|
}
|
|
}
|
|
|
|
xlog.Debug("Tool call routing decision",
|
|
"shouldUseFn", shouldUseFn,
|
|
"len(input.Functions)", len(input.Functions),
|
|
"len(input.Tools)", len(input.Tools),
|
|
"config.ShouldUseFunctions()", config.ShouldUseFunctions(),
|
|
"config.FunctionToCall()", config.FunctionToCall(),
|
|
)
|
|
|
|
for _, f := range input.Functions {
|
|
if f.Strict {
|
|
strictMode = true
|
|
break
|
|
}
|
|
}
|
|
|
|
// Allow the user to set custom actions via config file
|
|
// to be "embedded" in each model
|
|
noActionName := "answer"
|
|
noActionDescription := "use this action to answer without performing any action"
|
|
|
|
if config.FunctionsConfig.NoActionFunctionName != "" {
|
|
noActionName = config.FunctionsConfig.NoActionFunctionName
|
|
}
|
|
if config.FunctionsConfig.NoActionDescriptionName != "" {
|
|
noActionDescription = config.FunctionsConfig.NoActionDescriptionName
|
|
}
|
|
|
|
// If we are using a response format, we need to generate a grammar for it
|
|
if config.ResponseFormatMap != nil {
|
|
d := schema.ChatCompletionResponseFormat{}
|
|
dat, err := json.Marshal(config.ResponseFormatMap)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
err = json.Unmarshal(dat, &d)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
switch d.Type {
|
|
case "json_object":
|
|
input.Grammar = functions.JSONBNF
|
|
case "json_schema":
|
|
d := schema.JsonSchemaRequest{}
|
|
dat, err := json.Marshal(config.ResponseFormatMap)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
err = json.Unmarshal(dat, &d)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
fs := &functions.JSONFunctionStructure{
|
|
AnyOf: []functions.Item{d.JsonSchema.Schema},
|
|
}
|
|
g, err := fs.Grammar(config.FunctionsConfig.GrammarOptions()...)
|
|
if err == nil {
|
|
input.Grammar = g
|
|
} else {
|
|
xlog.Error("Failed generating grammar", "error", err)
|
|
}
|
|
}
|
|
}
|
|
|
|
config.Grammar = input.Grammar
|
|
|
|
if shouldUseFn {
|
|
xlog.Debug("Response needs to process functions")
|
|
}
|
|
|
|
switch {
|
|
// Generates grammar with internal's LocalAI engine
|
|
case (!config.FunctionsConfig.GrammarConfig.NoGrammar || strictMode) && shouldUseFn:
|
|
noActionGrammar := functions.Function{
|
|
Name: noActionName,
|
|
Description: noActionDescription,
|
|
Parameters: map[string]any{
|
|
"properties": map[string]any{
|
|
"message": map[string]any{
|
|
"type": "string",
|
|
"description": "The message to reply the user with",
|
|
}},
|
|
},
|
|
}
|
|
|
|
// Append the no action function
|
|
if !config.FunctionsConfig.DisableNoAction && !strictMode {
|
|
funcs = append(funcs, noActionGrammar)
|
|
}
|
|
|
|
// Force picking one of the functions by the request
|
|
if config.FunctionToCall() != "" {
|
|
funcs = funcs.Select(config.FunctionToCall())
|
|
}
|
|
|
|
// Update input grammar or json_schema based on use_llama_grammar option
|
|
jsStruct := funcs.ToJSONStructure(config.FunctionsConfig.FunctionNameKey, config.FunctionsConfig.FunctionNameKey)
|
|
g, err := jsStruct.Grammar(config.FunctionsConfig.GrammarOptions()...)
|
|
if err == nil {
|
|
config.Grammar = g
|
|
} else {
|
|
xlog.Error("Failed generating grammar", "error", err)
|
|
}
|
|
case input.JSONFunctionGrammarObject != nil:
|
|
g, err := input.JSONFunctionGrammarObject.Grammar(config.FunctionsConfig.GrammarOptions()...)
|
|
if err == nil {
|
|
config.Grammar = g
|
|
} else {
|
|
xlog.Error("Failed generating grammar", "error", err)
|
|
}
|
|
|
|
default:
|
|
// Force picking one of the functions by the request
|
|
if config.FunctionToCall() != "" {
|
|
funcs = funcs.Select(config.FunctionToCall())
|
|
}
|
|
}
|
|
|
|
// process functions if we have any defined or if we have a function call string
|
|
|
|
// functions are not supported in stream mode (yet?)
|
|
toStream := input.Stream
|
|
|
|
xlog.Debug("Parameters", "config", config)
|
|
|
|
var predInput string
|
|
|
|
// If we are using the tokenizer template, we don't need to process the messages
|
|
// unless we are processing functions
|
|
if !config.TemplateConfig.UseTokenizerTemplate {
|
|
predInput = evaluator.TemplateMessages(*input, input.Messages, config, funcs, shouldUseFn)
|
|
|
|
xlog.Debug("Prompt (after templating)", "prompt", predInput)
|
|
if config.Grammar != "" {
|
|
xlog.Debug("Grammar", "grammar", config.Grammar)
|
|
}
|
|
}
|
|
|
|
switch {
|
|
case toStream:
|
|
|
|
xlog.Debug("Stream request received")
|
|
c.Response().Header().Set("Content-Type", "text/event-stream")
|
|
c.Response().Header().Set("Cache-Control", "no-cache")
|
|
c.Response().Header().Set("Connection", "keep-alive")
|
|
c.Response().Header().Set("X-Correlation-ID", id)
|
|
|
|
mcpStreamMaxIterations := 10
|
|
if config.Agent.MaxIterations > 0 {
|
|
mcpStreamMaxIterations = config.Agent.MaxIterations
|
|
}
|
|
hasMCPToolsStream := mcpExecutor != nil && mcpExecutor.HasTools()
|
|
|
|
for mcpStreamIter := 0; mcpStreamIter <= mcpStreamMaxIterations; mcpStreamIter++ {
|
|
// Re-template on MCP iterations
|
|
if mcpStreamIter > 0 && !config.TemplateConfig.UseTokenizerTemplate {
|
|
predInput = evaluator.TemplateMessages(*input, input.Messages, config, funcs, shouldUseFn)
|
|
xlog.Debug("MCP stream re-templating", "iteration", mcpStreamIter)
|
|
}
|
|
|
|
responses := make(chan schema.OpenAIResponse)
|
|
ended := make(chan error, 1)
|
|
|
|
go func() {
|
|
if !shouldUseFn {
|
|
ended <- process(predInput, input, config, ml, responses, extraUsage, id, created)
|
|
} else {
|
|
ended <- processTools(noActionName, predInput, input, config, ml, responses, extraUsage, id, created, &textContentToReturn)
|
|
}
|
|
}()
|
|
|
|
usage := &schema.OpenAIUsage{}
|
|
toolsCalled := false
|
|
var collectedToolCalls []schema.ToolCall
|
|
var collectedContent string
|
|
|
|
LOOP:
|
|
for {
|
|
select {
|
|
case <-input.Context.Done():
|
|
// Context was cancelled (client disconnected or request cancelled)
|
|
xlog.Debug("Request context cancelled, stopping stream")
|
|
input.Cancel()
|
|
break LOOP
|
|
case ev := <-responses:
|
|
if len(ev.Choices) == 0 {
|
|
xlog.Debug("No choices in the response, skipping")
|
|
continue
|
|
}
|
|
usage = &ev.Usage // Copy a pointer to the latest usage chunk so that the stop message can reference it
|
|
if len(ev.Choices[0].Delta.ToolCalls) > 0 {
|
|
toolsCalled = true
|
|
// Collect and merge tool call deltas for MCP execution
|
|
if hasMCPToolsStream {
|
|
collectedToolCalls = mergeToolCallDeltas(collectedToolCalls, ev.Choices[0].Delta.ToolCalls)
|
|
}
|
|
}
|
|
// Collect content for MCP conversation history and automatic tool parsing fallback
|
|
if (hasMCPToolsStream || config.FunctionsConfig.AutomaticToolParsingFallback) && ev.Choices[0].Delta != nil && ev.Choices[0].Delta.Content != nil {
|
|
if s, ok := ev.Choices[0].Delta.Content.(string); ok {
|
|
collectedContent += s
|
|
} else if sp, ok := ev.Choices[0].Delta.Content.(*string); ok && sp != nil {
|
|
collectedContent += *sp
|
|
}
|
|
}
|
|
respData, err := json.Marshal(ev)
|
|
if err != nil {
|
|
xlog.Debug("Failed to marshal response", "error", err)
|
|
input.Cancel()
|
|
continue
|
|
}
|
|
xlog.Debug("Sending chunk", "chunk", string(respData))
|
|
_, err = fmt.Fprintf(c.Response().Writer, "data: %s\n\n", string(respData))
|
|
if err != nil {
|
|
xlog.Debug("Sending chunk failed", "error", err)
|
|
input.Cancel()
|
|
return err
|
|
}
|
|
c.Response().Flush()
|
|
case err := <-ended:
|
|
if err == nil {
|
|
break LOOP
|
|
}
|
|
xlog.Error("Stream ended with error", "error", err)
|
|
|
|
errorResp := schema.ErrorResponse{
|
|
Error: &schema.APIError{
|
|
Message: err.Error(),
|
|
Type: "server_error",
|
|
Code: "server_error",
|
|
},
|
|
}
|
|
respData, marshalErr := json.Marshal(errorResp)
|
|
if marshalErr != nil {
|
|
xlog.Error("Failed to marshal error response", "error", marshalErr)
|
|
fmt.Fprintf(c.Response().Writer, "data: {\"error\":{\"message\":\"Internal error\",\"type\":\"server_error\"}}\n\n")
|
|
} else {
|
|
fmt.Fprintf(c.Response().Writer, "data: %s\n\n", respData)
|
|
}
|
|
fmt.Fprintf(c.Response().Writer, "data: [DONE]\n\n")
|
|
c.Response().Flush()
|
|
|
|
return nil
|
|
}
|
|
}
|
|
|
|
// Drain responses channel to unblock the background goroutine if it's
|
|
// still trying to send (e.g., after client disconnect). The goroutine
|
|
// calls close(responses) when done, which terminates the drain.
|
|
if input.Context.Err() != nil {
|
|
go func() { for range responses {} }()
|
|
<-ended
|
|
}
|
|
|
|
// MCP streaming tool execution: if we collected MCP tool calls, execute and loop
|
|
if hasMCPToolsStream && toolsCalled && len(collectedToolCalls) > 0 {
|
|
var hasMCPCalls bool
|
|
for _, tc := range collectedToolCalls {
|
|
if mcpExecutor != nil && mcpExecutor.IsTool(tc.FunctionCall.Name) {
|
|
hasMCPCalls = true
|
|
break
|
|
}
|
|
}
|
|
if hasMCPCalls {
|
|
// Append assistant message with tool_calls
|
|
assistantMsg := schema.Message{
|
|
Role: "assistant",
|
|
Content: collectedContent,
|
|
ToolCalls: collectedToolCalls,
|
|
}
|
|
input.Messages = append(input.Messages, assistantMsg)
|
|
|
|
// Execute MCP tool calls and stream results as tool_result events
|
|
for _, tc := range collectedToolCalls {
|
|
if mcpExecutor == nil || !mcpExecutor.IsTool(tc.FunctionCall.Name) {
|
|
continue
|
|
}
|
|
xlog.Debug("Executing MCP tool (stream)", "tool", tc.FunctionCall.Name, "iteration", mcpStreamIter)
|
|
toolResult, toolErr := mcpExecutor.ExecuteTool(c.Request().Context(), tc.FunctionCall.Name, tc.FunctionCall.Arguments)
|
|
if toolErr != nil {
|
|
xlog.Error("MCP tool execution failed", "tool", tc.FunctionCall.Name, "error", toolErr)
|
|
toolResult = fmt.Sprintf("Error: %v", toolErr)
|
|
}
|
|
input.Messages = append(input.Messages, schema.Message{
|
|
Role: "tool",
|
|
Content: toolResult,
|
|
StringContent: toolResult,
|
|
ToolCallID: tc.ID,
|
|
Name: tc.FunctionCall.Name,
|
|
})
|
|
|
|
// Stream tool result event to client
|
|
mcpEvent := map[string]any{
|
|
"type": "mcp_tool_result",
|
|
"name": tc.FunctionCall.Name,
|
|
"result": toolResult,
|
|
}
|
|
if mcpEventData, err := json.Marshal(mcpEvent); err == nil {
|
|
fmt.Fprintf(c.Response().Writer, "data: %s\n\n", mcpEventData)
|
|
c.Response().Flush()
|
|
}
|
|
}
|
|
|
|
xlog.Debug("MCP streaming tools executed, re-running inference", "iteration", mcpStreamIter)
|
|
continue // next MCP stream iteration
|
|
}
|
|
}
|
|
|
|
// Automatic tool parsing fallback for streaming: when no tools were
|
|
// requested but the model emitted tool call markup, parse and emit them.
|
|
if !shouldUseFn && config.FunctionsConfig.AutomaticToolParsingFallback && collectedContent != "" && !toolsCalled {
|
|
parsed := functions.ParseFunctionCall(collectedContent, config.FunctionsConfig)
|
|
for i, fc := range parsed {
|
|
toolCallID := fc.ID
|
|
if toolCallID == "" {
|
|
toolCallID = id
|
|
}
|
|
toolCallMsg := schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: input.Model,
|
|
Choices: []schema.Choice{{
|
|
Delta: &schema.Message{
|
|
Role: "assistant",
|
|
ToolCalls: []schema.ToolCall{{
|
|
Index: i,
|
|
ID: toolCallID,
|
|
Type: "function",
|
|
FunctionCall: schema.FunctionCall{
|
|
Name: fc.Name,
|
|
Arguments: fc.Arguments,
|
|
},
|
|
}},
|
|
},
|
|
Index: 0,
|
|
}},
|
|
Object: "chat.completion.chunk",
|
|
}
|
|
respData, _ := json.Marshal(toolCallMsg)
|
|
fmt.Fprintf(c.Response().Writer, "data: %s\n\n", respData)
|
|
c.Response().Flush()
|
|
toolsCalled = true
|
|
}
|
|
}
|
|
|
|
// No MCP tools to execute, send final stop message
|
|
finishReason := FinishReasonStop
|
|
if toolsCalled && len(input.Tools) > 0 {
|
|
finishReason = FinishReasonToolCalls
|
|
} else if toolsCalled {
|
|
finishReason = FinishReasonFunctionCall
|
|
}
|
|
|
|
resp := &schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: input.Model, // we have to return what the user sent here, due to OpenAI spec.
|
|
Choices: []schema.Choice{
|
|
{
|
|
FinishReason: &finishReason,
|
|
Index: 0,
|
|
Delta: &schema.Message{},
|
|
}},
|
|
Object: "chat.completion.chunk",
|
|
Usage: *usage,
|
|
}
|
|
respData, _ := json.Marshal(resp)
|
|
|
|
fmt.Fprintf(c.Response().Writer, "data: %s\n\n", respData)
|
|
fmt.Fprintf(c.Response().Writer, "data: [DONE]\n\n")
|
|
c.Response().Flush()
|
|
xlog.Debug("Stream ended")
|
|
return nil
|
|
} // end MCP stream iteration loop
|
|
|
|
// Safety fallback
|
|
fmt.Fprintf(c.Response().Writer, "data: [DONE]\n\n")
|
|
c.Response().Flush()
|
|
return nil
|
|
|
|
// no streaming mode
|
|
default:
|
|
mcpMaxIterations := 10
|
|
if config.Agent.MaxIterations > 0 {
|
|
mcpMaxIterations = config.Agent.MaxIterations
|
|
}
|
|
hasMCPTools := mcpExecutor != nil && mcpExecutor.HasTools()
|
|
|
|
for mcpIteration := 0; mcpIteration <= mcpMaxIterations; mcpIteration++ {
|
|
// Re-template on each MCP iteration since messages may have changed
|
|
if mcpIteration > 0 && !config.TemplateConfig.UseTokenizerTemplate {
|
|
predInput = evaluator.TemplateMessages(*input, input.Messages, config, funcs, shouldUseFn)
|
|
xlog.Debug("MCP re-templating", "iteration", mcpIteration, "prompt_len", len(predInput))
|
|
}
|
|
|
|
// Detect if thinking token is already in prompt or template
|
|
var template string
|
|
if config.TemplateConfig.UseTokenizerTemplate {
|
|
template = config.GetModelTemplate() // TODO: this should be the parsed jinja template. But for now this is the best we can do.
|
|
} else {
|
|
template = predInput
|
|
}
|
|
thinkingStartToken := reason.DetectThinkingStartToken(template, &config.ReasoningConfig)
|
|
|
|
xlog.Debug("Thinking start token", "thinkingStartToken", thinkingStartToken, "template", template)
|
|
|
|
// When shouldUseFn, the callback just stores the raw text — tool parsing
|
|
// is deferred to after ComputeChoices so we can check chat deltas first
|
|
// and avoid redundant Go-side parsing.
|
|
var cbRawResult, cbReasoning string
|
|
|
|
tokenCallback := func(s string, c *[]schema.Choice) {
|
|
reasoning, s := reason.ExtractReasoningWithConfig(s, thinkingStartToken, config.ReasoningConfig)
|
|
|
|
if !shouldUseFn {
|
|
stopReason := FinishReasonStop
|
|
message := &schema.Message{Role: "assistant", Content: &s}
|
|
if reasoning != "" {
|
|
message.Reasoning = &reasoning
|
|
}
|
|
*c = append(*c, schema.Choice{FinishReason: &stopReason, Index: 0, Message: message})
|
|
return
|
|
}
|
|
|
|
// Store raw text for deferred tool parsing
|
|
cbRawResult = s
|
|
cbReasoning = reasoning
|
|
}
|
|
|
|
var result []schema.Choice
|
|
var tokenUsage backend.TokenUsage
|
|
var err error
|
|
|
|
var chatDeltas []*pb.ChatDelta
|
|
result, tokenUsage, chatDeltas, err = ComputeChoices(
|
|
input,
|
|
predInput,
|
|
config,
|
|
cl,
|
|
startupOptions,
|
|
ml,
|
|
tokenCallback,
|
|
nil,
|
|
func(attempt int) bool {
|
|
if !shouldUseFn {
|
|
return false
|
|
}
|
|
// Retry when backend produced only reasoning and no content/tool calls.
|
|
// Full tool parsing is deferred until after ComputeChoices returns
|
|
// (when chat deltas are available), but we can detect the empty case here.
|
|
if cbRawResult == "" && textContentToReturn == "" {
|
|
xlog.Warn("Backend produced reasoning without actionable content, retrying",
|
|
"reasoning_len", len(cbReasoning), "attempt", attempt+1)
|
|
cbRawResult = ""
|
|
cbReasoning = ""
|
|
textContentToReturn = ""
|
|
return true
|
|
}
|
|
return false
|
|
},
|
|
)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
// For non-tool requests: prefer C++ autoparser chat deltas over
|
|
// Go-side tag extraction (which can mangle output when thinkingStartToken
|
|
// differs from the model's actual reasoning tags, e.g. Gemma 4).
|
|
if !shouldUseFn && len(chatDeltas) > 0 {
|
|
deltaContent := functions.ContentFromChatDeltas(chatDeltas)
|
|
deltaReasoning := functions.ReasoningFromChatDeltas(chatDeltas)
|
|
if deltaContent != "" || deltaReasoning != "" {
|
|
xlog.Debug("[ChatDeltas] non-SSE no-tools: overriding result with C++ autoparser deltas",
|
|
"content_len", len(deltaContent), "reasoning_len", len(deltaReasoning))
|
|
stopReason := FinishReasonStop
|
|
message := &schema.Message{Role: "assistant", Content: &deltaContent}
|
|
if deltaReasoning != "" {
|
|
message.Reasoning = &deltaReasoning
|
|
}
|
|
newChoice := schema.Choice{FinishReason: &stopReason, Index: 0, Message: message}
|
|
// Preserve logprobs from the original result
|
|
if len(result) > 0 && result[0].Logprobs != nil {
|
|
newChoice.Logprobs = result[0].Logprobs
|
|
}
|
|
result = []schema.Choice{newChoice}
|
|
}
|
|
}
|
|
|
|
// Tool parsing is deferred here (only when shouldUseFn) so chat deltas are available
|
|
if shouldUseFn {
|
|
var funcResults []functions.FuncCallResults
|
|
|
|
// Try pre-parsed tool calls from C++ autoparser first
|
|
if deltaToolCalls := functions.ToolCallsFromChatDeltas(chatDeltas); len(deltaToolCalls) > 0 {
|
|
xlog.Debug("[ChatDeltas] non-SSE: using C++ autoparser tool calls, skipping Go-side parsing", "count", len(deltaToolCalls))
|
|
funcResults = deltaToolCalls
|
|
textContentToReturn = functions.ContentFromChatDeltas(chatDeltas)
|
|
cbReasoning = functions.ReasoningFromChatDeltas(chatDeltas)
|
|
} else if deltaContent := functions.ContentFromChatDeltas(chatDeltas); len(chatDeltas) > 0 && deltaContent != "" {
|
|
// ChatDeltas have content but no tool calls — model answered without using tools.
|
|
// This happens with thinking models (e.g. Gemma 4) where the Go-side reasoning
|
|
// extraction misclassifies clean content as reasoning, leaving cbRawResult empty.
|
|
xlog.Debug("[ChatDeltas] non-SSE: using C++ autoparser content (no tool calls)", "content_len", len(deltaContent))
|
|
textContentToReturn = deltaContent
|
|
cbReasoning = functions.ReasoningFromChatDeltas(chatDeltas)
|
|
} else {
|
|
// Fallback: parse tool calls from raw text
|
|
xlog.Debug("[ChatDeltas] non-SSE: no chat deltas, falling back to Go-side text parsing")
|
|
textContentToReturn = functions.ParseTextContent(cbRawResult, config.FunctionsConfig)
|
|
cbRawResult = functions.CleanupLLMResult(cbRawResult, config.FunctionsConfig)
|
|
funcResults = functions.ParseFunctionCall(cbRawResult, config.FunctionsConfig)
|
|
}
|
|
|
|
// Content-based tool call fallback: if no tool calls were found,
|
|
// try parsing the raw result — ParseFunctionCall handles detection internally.
|
|
if len(funcResults) == 0 {
|
|
contentFuncResults := functions.ParseFunctionCall(cbRawResult, config.FunctionsConfig)
|
|
if len(contentFuncResults) > 0 {
|
|
funcResults = contentFuncResults
|
|
textContentToReturn = functions.StripToolCallMarkup(cbRawResult)
|
|
}
|
|
}
|
|
|
|
noActionsToRun := len(funcResults) > 0 && funcResults[0].Name == noActionName || len(funcResults) == 0
|
|
|
|
switch {
|
|
case noActionsToRun:
|
|
// Use textContentToReturn if available (e.g. from ChatDeltas),
|
|
// otherwise fall back to cbRawResult for legacy Go-side parsing.
|
|
questionInput := cbRawResult
|
|
if textContentToReturn != "" {
|
|
questionInput = textContentToReturn
|
|
}
|
|
qResult, qErr := handleQuestion(config, funcResults, questionInput, predInput)
|
|
if qErr != nil {
|
|
xlog.Error("error handling question", "error", qErr)
|
|
}
|
|
|
|
stopReason := FinishReasonStop
|
|
message := &schema.Message{Role: "assistant", Content: &qResult}
|
|
if cbReasoning != "" {
|
|
message.Reasoning = &cbReasoning
|
|
}
|
|
result = append(result, schema.Choice{
|
|
FinishReason: &stopReason,
|
|
Message: message,
|
|
})
|
|
default:
|
|
toolCallsReason := FinishReasonToolCalls
|
|
toolChoice := schema.Choice{
|
|
FinishReason: &toolCallsReason,
|
|
Message: &schema.Message{
|
|
Role: "assistant",
|
|
},
|
|
}
|
|
if cbReasoning != "" {
|
|
toolChoice.Message.Reasoning = &cbReasoning
|
|
}
|
|
|
|
for _, ss := range funcResults {
|
|
name, args := ss.Name, ss.Arguments
|
|
toolCallID := ss.ID
|
|
if toolCallID == "" {
|
|
toolCallID = id
|
|
}
|
|
if len(input.Tools) > 0 {
|
|
toolChoice.Message.Content = textContentToReturn
|
|
toolChoice.Message.ToolCalls = append(toolChoice.Message.ToolCalls,
|
|
schema.ToolCall{
|
|
ID: toolCallID,
|
|
Type: "function",
|
|
FunctionCall: schema.FunctionCall{
|
|
Name: name,
|
|
Arguments: args,
|
|
},
|
|
},
|
|
)
|
|
} else {
|
|
// Deprecated function_call format
|
|
functionCallReason := FinishReasonFunctionCall
|
|
message := &schema.Message{
|
|
Role: "assistant",
|
|
Content: &textContentToReturn,
|
|
FunctionCall: map[string]any{
|
|
"name": name,
|
|
"arguments": args,
|
|
},
|
|
}
|
|
if cbReasoning != "" {
|
|
message.Reasoning = &cbReasoning
|
|
}
|
|
result = append(result, schema.Choice{
|
|
FinishReason: &functionCallReason,
|
|
Message: message,
|
|
})
|
|
}
|
|
}
|
|
|
|
if len(input.Tools) > 0 {
|
|
result = append(result, toolChoice)
|
|
}
|
|
}
|
|
}
|
|
|
|
// Automatic tool parsing fallback: when no tools/functions were in the
|
|
// request but the model emitted tool call markup, parse and surface them.
|
|
if !shouldUseFn && config.FunctionsConfig.AutomaticToolParsingFallback && len(result) > 0 {
|
|
for i, choice := range result {
|
|
if choice.Message == nil || choice.Message.Content == nil {
|
|
continue
|
|
}
|
|
contentStr, ok := choice.Message.Content.(string)
|
|
if !ok || contentStr == "" {
|
|
continue
|
|
}
|
|
parsed := functions.ParseFunctionCall(contentStr, config.FunctionsConfig)
|
|
if len(parsed) == 0 {
|
|
continue
|
|
}
|
|
stripped := functions.StripToolCallMarkup(contentStr)
|
|
toolCallsReason := FinishReasonToolCalls
|
|
result[i].FinishReason = &toolCallsReason
|
|
if stripped != "" {
|
|
result[i].Message.Content = &stripped
|
|
} else {
|
|
result[i].Message.Content = nil
|
|
}
|
|
for _, fc := range parsed {
|
|
toolCallID := fc.ID
|
|
if toolCallID == "" {
|
|
toolCallID = id
|
|
}
|
|
result[i].Message.ToolCalls = append(result[i].Message.ToolCalls,
|
|
schema.ToolCall{
|
|
ID: toolCallID,
|
|
Type: "function",
|
|
FunctionCall: schema.FunctionCall{
|
|
Name: fc.Name,
|
|
Arguments: fc.Arguments,
|
|
},
|
|
},
|
|
)
|
|
}
|
|
}
|
|
}
|
|
|
|
// MCP server-side tool execution loop:
|
|
// If we have MCP tools and the model returned tool_calls, execute MCP tools
|
|
// and re-run inference with the results appended to the conversation.
|
|
if hasMCPTools && len(result) > 0 {
|
|
var mcpCallsExecuted bool
|
|
for _, choice := range result {
|
|
if choice.Message == nil || len(choice.Message.ToolCalls) == 0 {
|
|
continue
|
|
}
|
|
// Check if any tool calls are MCP tools
|
|
var hasMCPCalls bool
|
|
for _, tc := range choice.Message.ToolCalls {
|
|
if mcpExecutor != nil && mcpExecutor.IsTool(tc.FunctionCall.Name) {
|
|
hasMCPCalls = true
|
|
break
|
|
}
|
|
}
|
|
if !hasMCPCalls {
|
|
continue
|
|
}
|
|
|
|
// Append assistant message with tool_calls to conversation
|
|
assistantContent := ""
|
|
if choice.Message.Content != nil {
|
|
if s, ok := choice.Message.Content.(string); ok {
|
|
assistantContent = s
|
|
} else if sp, ok := choice.Message.Content.(*string); ok && sp != nil {
|
|
assistantContent = *sp
|
|
}
|
|
}
|
|
assistantMsg := schema.Message{
|
|
Role: "assistant",
|
|
Content: assistantContent,
|
|
ToolCalls: choice.Message.ToolCalls,
|
|
}
|
|
input.Messages = append(input.Messages, assistantMsg)
|
|
|
|
// Execute each MCP tool call and append results
|
|
for _, tc := range choice.Message.ToolCalls {
|
|
if mcpExecutor == nil || !mcpExecutor.IsTool(tc.FunctionCall.Name) {
|
|
continue
|
|
}
|
|
xlog.Debug("Executing MCP tool", "tool", tc.FunctionCall.Name, "arguments", tc.FunctionCall.Arguments, "iteration", mcpIteration)
|
|
toolResult, toolErr := mcpExecutor.ExecuteTool(c.Request().Context(), tc.FunctionCall.Name, tc.FunctionCall.Arguments)
|
|
if toolErr != nil {
|
|
xlog.Error("MCP tool execution failed", "tool", tc.FunctionCall.Name, "error", toolErr)
|
|
toolResult = fmt.Sprintf("Error: %v", toolErr)
|
|
}
|
|
input.Messages = append(input.Messages, schema.Message{
|
|
Role: "tool",
|
|
Content: toolResult,
|
|
StringContent: toolResult,
|
|
ToolCallID: tc.ID,
|
|
Name: tc.FunctionCall.Name,
|
|
})
|
|
mcpCallsExecuted = true
|
|
}
|
|
}
|
|
|
|
if mcpCallsExecuted {
|
|
xlog.Debug("MCP tools executed, re-running inference", "iteration", mcpIteration, "messages", len(input.Messages))
|
|
continue // next MCP iteration
|
|
}
|
|
}
|
|
|
|
// No MCP tools to execute (or no MCP tools configured), return response
|
|
usage := schema.OpenAIUsage{
|
|
PromptTokens: tokenUsage.Prompt,
|
|
CompletionTokens: tokenUsage.Completion,
|
|
TotalTokens: tokenUsage.Prompt + tokenUsage.Completion,
|
|
}
|
|
if extraUsage {
|
|
usage.TimingTokenGeneration = tokenUsage.TimingTokenGeneration
|
|
usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
|
|
}
|
|
|
|
resp := &schema.OpenAIResponse{
|
|
ID: id,
|
|
Created: created,
|
|
Model: input.Model, // we have to return what the user sent here, due to OpenAI spec.
|
|
Choices: result,
|
|
Object: "chat.completion",
|
|
Usage: usage,
|
|
}
|
|
respData, _ := json.Marshal(resp)
|
|
xlog.Debug("Response", "response", string(respData))
|
|
|
|
// Return the prediction in the response body
|
|
return c.JSON(200, resp)
|
|
} // end MCP iteration loop
|
|
|
|
// Should not reach here, but safety fallback
|
|
return fmt.Errorf("MCP iteration limit reached")
|
|
}
|
|
}
|
|
}
|
|
|
|
func handleQuestion(config *config.ModelConfig, funcResults []functions.FuncCallResults, result, prompt string) (string, error) {
|
|
|
|
if len(funcResults) == 0 && result != "" {
|
|
xlog.Debug("nothing function results but we had a message from the LLM")
|
|
|
|
return result, nil
|
|
}
|
|
|
|
xlog.Debug("nothing to do, computing a reply")
|
|
arg := ""
|
|
if len(funcResults) > 0 {
|
|
arg = funcResults[0].Arguments
|
|
}
|
|
// If there is a message that the LLM already sends as part of the JSON reply, use it
|
|
arguments := map[string]any{}
|
|
if err := json.Unmarshal([]byte(arg), &arguments); err != nil {
|
|
xlog.Debug("handleQuestion: function result did not contain a valid JSON object")
|
|
}
|
|
m, exists := arguments["message"]
|
|
if exists {
|
|
switch message := m.(type) {
|
|
case string:
|
|
if message != "" {
|
|
xlog.Debug("Reply received from LLM", "message", message)
|
|
message = backend.Finetune(*config, prompt, message)
|
|
xlog.Debug("Reply received from LLM(finetuned)", "message", message)
|
|
|
|
return message, nil
|
|
}
|
|
}
|
|
}
|
|
|
|
xlog.Debug("No action received from LLM, without a message, computing a reply")
|
|
|
|
return "", nil
|
|
}
|