feat: generic chat_template_kwargs (model config + per-request metadata) (#10359)

* feat(config): add chat_template_kwargs model field + resolver

Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier,
plus ResolveChatTemplateKwargs which layers the config map under coerced
request metadata. Foundation for generic jinja chat-template kwargs (issue #10329).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): forward resolved chat_template_kwargs blob to backends

gRPCPredictOpts now merges per-request client metadata over the server-derived
enable_thinking/reasoning_effort (reaching all backends via the standalone keys)
and serialises the resolved chat_template_kwargs map into a JSON blob for
llama.cpp, written last so a client cannot clobber it. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(http): wire request metadata to config.RequestMetadata

The OpenAI request metadata field was parsed but unused; stamp it onto the
per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs
overrides. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks)

Replace the per-key enable_thinking/reasoning_effort handling in both the
streaming and non-streaming chat paths with a single block that parses the
chat_template_kwargs JSON blob resolved by the Go layer and merges every key
into body_json. New jinja template levers (e.g. preserve_thinking) now need
no C++ change. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: document custom chat_template_kwargs (model + per-request)

Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob

Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC

Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the
received PredictOptions.Metadata, and an app_test.go spec that drives a real
/v1/chat/completions request (model chat_template_kwargs + per-request metadata
override) and asserts the exact metadata + chat_template_kwargs blob the REST
layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(config): grandfather chat_template_kwargs in registry coverage

chat_template_kwargs is a free-form map[string]any (like engine_args, already
on the list), not a scalar the config UI registry can surface, so it is exempt
from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries
failure introduced by the new field. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-06-16 12:16:34 +02:00
committed by GitHub
parent f44034021e
commit 1ab61a0875
11 changed files with 396 additions and 34 deletions

View File

@@ -1922,25 +1922,27 @@ public:
body_json["min_p"] = data["min_p"];
}
// Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
// Forward the chat_template_kwargs the Go layer resolved (model config
// chat_template_kwargs + per-request metadata: enable_thinking,
// reasoning_effort, preserve_thinking, ...). One generic merge replaces
// the previous per-key handling - new template levers need no C++ change.
// oaicompat_chat_params_parse reads these from body_json.
const auto& metadata = request->metadata();
auto et_it = metadata.find("enable_thinking");
if (et_it != metadata.end()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
auto ctk_it = metadata.find("chat_template_kwargs");
if (ctk_it != metadata.end() && !ctk_it->second.empty()) {
try {
json ctk = json::parse(ctk_it->second);
if (ctk.is_object()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
for (auto& el : ctk.items()) {
body_json["chat_template_kwargs"][el.key()] = el.value();
}
}
} catch (const std::exception & e) {
SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
}
body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
}
// Pass reasoning_effort via chat_template_kwargs too: the lever
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
// from enable_thinking which those templates ignore.
auto re_it = metadata.find("reasoning_effort");
if (re_it != metadata.end() && !re_it->second.empty()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
}
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2756,25 +2758,26 @@ public:
body_json["min_p"] = data["min_p"];
}
// Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
// Forward the chat_template_kwargs the Go layer resolved (model config
// chat_template_kwargs + per-request metadata: enable_thinking,
// reasoning_effort, preserve_thinking, ...). One generic merge replaces
// the previous per-key handling - new template levers need no C++ change.
const auto& predict_metadata = request->metadata();
auto predict_et_it = predict_metadata.find("enable_thinking");
if (predict_et_it != predict_metadata.end()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
auto predict_ctk_it = predict_metadata.find("chat_template_kwargs");
if (predict_ctk_it != predict_metadata.end() && !predict_ctk_it->second.empty()) {
try {
json ctk = json::parse(predict_ctk_it->second);
if (ctk.is_object()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
for (auto& el : ctk.items()) {
body_json["chat_template_kwargs"][el.key()] = el.value();
}
}
} catch (const std::exception & e) {
SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
}
body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
}
// Pass reasoning_effort via chat_template_kwargs too: the lever
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
// from enable_thinking which those templates ignore.
auto predict_re_it = predict_metadata.find("reasoning_effort");
if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
if (!body_json.contains("chat_template_kwargs")) {
body_json["chat_template_kwargs"] = json::object();
}
body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
}
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)

View File

@@ -368,6 +368,25 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
if c.ReasoningEffort != "" {
metadata["reasoning_effort"] = c.ReasoningEffort
}
// Client request metadata overrides the server-derived reasoning levers and
// reaches every backend through these standalone string keys (Python backends
// read them directly). The reserved blob key is server-owned and skipped.
for k, v := range c.RequestMetadata {
if k == "chat_template_kwargs" {
continue
}
metadata[k] = v
}
// Build the generic chat_template_kwargs blob (model config map + coerced
// metadata) for llama.cpp and write it LAST so a client cannot clobber it.
if blob := c.ResolveChatTemplateKwargs(metadata); len(blob) > 0 {
b, err := json.Marshal(blob)
if err != nil {
xlog.Warn("failed to marshal chat_template_kwargs", "error", err)
} else {
metadata["chat_template_kwargs"] = string(b)
}
}
pbOpts.Metadata = metadata
// Logprobs and TopLogprobs are set by the caller if provided

View File

@@ -161,3 +161,67 @@ var _ = Describe("grpcModelOpts NBatch", func() {
Expect(opts.ContextSize).To(BeEquivalentTo(4096), "n_batch must match the effective n_ctx the backend receives")
})
})
// Guards the generic chat_template_kwargs forwarding: the model config map plus any
// per-request metadata overrides are merged, coerced, and serialised into the
// backend metadata blob that llama.cpp reads. Client metadata also overrides the
// server-derived standalone enable_thinking key (cross-backend consistency).
var _ = Describe("gRPCPredictOpts chat_template_kwargs metadata", func() {
baseCfg := func() config.ModelConfig {
cfg := config.ModelConfig{}
cfg.SetDefaults()
return cfg
}
It("serialises the config map into the chat_template_kwargs blob", func() {
cfg := baseCfg()
cfg.ChatTemplateKwargs = map[string]any{"preserve_thinking": true}
opts := gRPCPredictOpts(cfg, "/tmp/models")
Expect(opts.Metadata).To(HaveKey("chat_template_kwargs"))
var blob map[string]any
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
Expect(blob).To(HaveKeyWithValue("preserve_thinking", true))
})
It("serialises reasoning_effort into the blob as a JSON string", func() {
cfg := baseCfg()
cfg.ReasoningEffort = "high"
opts := gRPCPredictOpts(cfg, "/tmp/models")
Expect(opts.Metadata).To(HaveKey("chat_template_kwargs"))
var blob map[string]any
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
// reasoning_effort must remain a string in the blob (jinja templates that
// key on the level read a string), unlike enable_thinking which is a bool.
Expect(blob["reasoning_effort"]).To(BeAssignableToTypeOf(""))
Expect(blob).To(HaveKeyWithValue("reasoning_effort", "high"))
})
It("lets client request metadata override the server-derived enable_thinking key", func() {
cfg := baseCfg()
disable := true
cfg.ReasoningConfig = reasoning.Config{DisableReasoning: &disable} // server: enable_thinking=false
cfg.RequestMetadata = map[string]string{"enable_thinking": "true"} // client overrides
opts := gRPCPredictOpts(cfg, "/tmp/models")
// standalone key (Python backends) reflects the client override
Expect(opts.Metadata).To(HaveKeyWithValue("enable_thinking", "true"))
// blob (llama.cpp) reflects it too, as a real bool
var blob map[string]any
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
Expect(blob).To(HaveKeyWithValue("enable_thinking", true))
})
It("does not let a client clobber the blob via a chat_template_kwargs metadata key", func() {
cfg := baseCfg()
cfg.ChatTemplateKwargs = map[string]any{"preserve_thinking": true}
cfg.RequestMetadata = map[string]string{"chat_template_kwargs": "{\"preserve_thinking\": false}"}
opts := gRPCPredictOpts(cfg, "/tmp/models")
var blob map[string]any
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
Expect(blob).To(HaveKeyWithValue("preserve_thinking", true))
})
It("omits the blob when there is nothing to forward", func() {
opts := gRPCPredictOpts(baseCfg(), "/tmp/models")
Expect(opts.Metadata).ToNot(HaveKey("chat_template_kwargs"))
})
})

View File

@@ -0,0 +1,48 @@
package config_test
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
)
// ResolveChatTemplateKwargs layers the model config map (base) under the coerced
// backend metadata (server reasoning levers + client request overrides).
var _ = Describe("ModelConfig.ResolveChatTemplateKwargs", func() {
It("returns nil when nothing is set", func() {
c := &config.ModelConfig{}
Expect(c.ResolveChatTemplateKwargs(nil)).To(BeNil())
})
It("returns the config map when no metadata is present", func() {
c := &config.ModelConfig{ChatTemplateKwargs: map[string]any{"preserve_thinking": true}}
Expect(c.ResolveChatTemplateKwargs(nil)).To(HaveKeyWithValue("preserve_thinking", true))
})
It("lets metadata override the config map", func() {
c := &config.ModelConfig{ChatTemplateKwargs: map[string]any{"enable_thinking": true}}
got := c.ResolveChatTemplateKwargs(map[string]string{"enable_thinking": "false"})
Expect(got).To(HaveKeyWithValue("enable_thinking", false))
})
It("coerces true/false to bool and leaves other strings as-is", func() {
c := &config.ModelConfig{}
got := c.ResolveChatTemplateKwargs(map[string]string{
"enable_thinking": "true",
"reasoning_effort": "high",
})
Expect(got).To(HaveKeyWithValue("enable_thinking", true))
Expect(got).To(HaveKeyWithValue("reasoning_effort", "high"))
})
It("skips the reserved chat_template_kwargs metadata key but keeps siblings", func() {
c := &config.ModelConfig{}
got := c.ResolveChatTemplateKwargs(map[string]string{
"chat_template_kwargs": "{\"x\":1}",
"preserve_thinking": "true",
})
Expect(got).ToNot(HaveKey("chat_template_kwargs"))
Expect(got).To(HaveKeyWithValue("preserve_thinking", true))
})
})

View File

@@ -112,6 +112,7 @@ var grandfatheredUnregistered = []string{
"agent.max_attempts",
"agent.max_iterations",
"cfg_scale",
"chat_template_kwargs",
"concurrency_groups",
"cutstrings",
"debug",

View File

@@ -70,6 +70,19 @@ type ModelConfig struct {
// (Harmony) or LFM2.5 — honor it; "none" also toggles enable_thinking off.
ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`
// ChatTemplateKwargs are arbitrary key/values forwarded to the backend's jinja
// chat template via chat_template_kwargs (e.g. preserve_thinking: true). The
// server-derived reasoning levers (enable_thinking / reasoning_effort) and any
// per-request metadata overrides layer on top. See gRPCPredictOpts.
ChatTemplateKwargs map[string]any `yaml:"chat_template_kwargs,omitempty" json:"chat_template_kwargs,omitempty"`
// RequestMetadata holds the raw client request `metadata` map for the current
// request. The request middleware stamps it; gRPCPredictOpts merges it into the
// backend gRPC metadata (overriding the server-derived enable_thinking /
// reasoning_effort) and folds it, coerced, into the chat_template_kwargs blob.
// Never persisted to YAML.
RequestMetadata map[string]string `yaml:"-" json:"-"`
FeatureFlag FeatureFlag `yaml:"feature_flags,omitempty" json:"feature_flags,omitempty"` // Feature Flag registry. We move fast, and features may break on a per model/backend basis. Registry for (usually temporary) flags that indicate aborting something early.
// LLM configs (GPT4ALL, Llama.cpp, ...)
LLMConfig `yaml:",inline" json:",inline"`
@@ -551,6 +564,44 @@ func (c *ModelConfig) ApplyReasoningEffort(requestEffort string) {
}
}
// coerceChatTemplateKwarg coerces a request-metadata string value for use as a
// jinja chat_template_kwarg. "true"/"false" become real booleans (so a jinja
// `{% if preserve_thinking %}` reads false correctly, since any non-empty string
// is truthy); everything else stays a string. Numeric/typed per-request values are
// out of scope - set those in the model YAML chat_template_kwargs (YAML keeps the type).
func coerceChatTemplateKwarg(v string) any {
switch v {
case "true":
return true
case "false":
return false
default:
return v
}
}
// ResolveChatTemplateKwargs builds the final chat_template_kwargs map forwarded to
// the backend, layered: the model config map (base) < the coerced backend metadata
// (server reasoning levers + client request overrides). `meta` is the already-merged
// backend metadata string map. The reserved "chat_template_kwargs" key is skipped so
// a client cannot smuggle a nested blob. Returns nil when there is nothing to forward.
func (c *ModelConfig) ResolveChatTemplateKwargs(meta map[string]string) map[string]any {
out := map[string]any{}
for k, v := range c.ChatTemplateKwargs {
out[k] = v
}
for k, v := range meta {
if k == "chat_template_kwargs" {
continue
}
out[k] = coerceChatTemplateKwarg(v)
}
if len(out) == 0 {
return nil
}
return out
}
// @Description PipelineStreaming toggles incremental delivery per realtime stage.
type PipelineStreaming struct {
LLM *bool `yaml:"llm,omitempty" json:"llm,omitempty"`

View File

@@ -735,6 +735,18 @@ parameters:
`
Expect(os.WriteFile(filepath.Join(modelDir, "mock-model.yaml"), []byte(mockModelYAML), 0644)).To(Succeed())
// A second model carrying chat_template_kwargs so the REST->gRPC
// metadata-forwarding spec below can assert the model-YAML kwarg is
// merged with the per-request override.
mockCTKModelYAML := `name: mock-ctk-model
backend: mock-backend
parameters:
model: mock-model.bin
chat_template_kwargs:
preserve_thinking: true
`
Expect(os.WriteFile(filepath.Join(modelDir, "mock-ctk-model.yaml"), []byte(mockCTKModelYAML), 0644)).To(Succeed())
systemState, err := system.GetSystemState(
system.WithBackendPath(backendDir),
system.WithModelPath(modelDir),
@@ -809,6 +821,59 @@ parameters:
Expect(string(dat)).To(ContainSubstring("mock-backend"))
})
It("forwards chat_template_kwargs and reasoning levers to gRPC PredictOptions.Metadata", func() {
// True HTTP->gRPC contract guard: drive a real /v1/chat/completions
// request and assert the exact metadata the REST layer forwarded to
// the backend. The mock-backend echoes PredictOptions.Metadata as JSON
// when it sees the ECHO_PREDICT_METADATA marker in the prompt, so this
// pins the request->gRPC mapping (model-YAML chat_template_kwargs +
// per-request metadata override + type coercion + standalone keys)
// without adding a new RPC. The marker rides in the user content and
// must survive into the backend prompt; if a future default chat
// template drops raw user content, move the marker to /v1/completions.
reqBody := map[string]any{
"model": "mock-ctk-model",
"messages": []map[string]any{
{"role": "user", "content": "ECHO_PREDICT_METADATA"},
},
// per-request override: overrides the standalone enable_thinking key
// and exercises coercion ("false" -> bool, "low" -> string) in the blob
"metadata": map[string]string{
"enable_thinking": "false",
"reasoning_effort": "low",
},
}
var chatResp struct {
Choices []struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
} `json:"choices"`
}
err := postRequestResponseJSON("http://127.0.0.1:9090/v1/chat/completions", &reqBody, &chatResp)
Expect(err).ToNot(HaveOccurred())
Expect(chatResp.Choices).ToNot(BeEmpty())
// The assistant content is the JSON snapshot of PredictOptions.Metadata.
var meta map[string]string
Expect(json.Unmarshal([]byte(chatResp.Choices[0].Message.Content), &meta)).To(Succeed(), "echoed metadata: %s", chatResp.Choices[0].Message.Content)
// Standalone keys reflect the per-request override (consumed by Python
// backends; consistent across backends).
Expect(meta).To(HaveKeyWithValue("enable_thinking", "false"))
Expect(meta).To(HaveKeyWithValue("reasoning_effort", "low"))
// The chat_template_kwargs blob (consumed by llama.cpp) merges the
// model-YAML kwarg with the coerced request metadata override.
Expect(meta).To(HaveKey("chat_template_kwargs"))
var ctk map[string]any
Expect(json.Unmarshal([]byte(meta["chat_template_kwargs"]), &ctk)).To(Succeed(), "chat_template_kwargs blob: %s", meta["chat_template_kwargs"])
Expect(ctk).To(HaveKeyWithValue("preserve_thinking", true)) // bool from model YAML
Expect(ctk).To(HaveKeyWithValue("enable_thinking", false)) // coerced "false" -> bool
Expect(ctk).To(HaveKeyWithValue("reasoning_effort", "low")) // non-bool stays string
})
// Agent Jobs: HTTP API for task/job scheduling. The underlying AgentPool
// service is exercised in core/services/agentpool/agent_jobs_test.go;
// these specs cover the /api/agent/* HTTP plumbing on top.

View File

@@ -318,6 +318,13 @@ func mergeOpenAIRequestAndModelConfig(config *config.ModelConfig, input *schema.
// (an operator's explicit disable wins over a request asking to think).
config.ApplyReasoningEffort(input.ReasoningEffort)
// Forward the client's request metadata so chat-template kwargs set per-request
// (enable_thinking, reasoning_effort, preserve_thinking, ...) reach the backend
// and override the model's reasoning-config defaults. See gRPCPredictOpts.
if len(input.Metadata) > 0 {
config.RequestMetadata = input.Metadata
}
// Collapse the modern max_completion_tokens alias into the
// legacy Maxtokens field so downstream code reads exactly one.
// MaxCompletionTokens wins on conflict — it's the canonical

View File

@@ -731,3 +731,60 @@ var _ = Describe("SetModelAndConfig reasoning_effort parsing (chat completions)"
Expect(*(*captured2).ReasoningConfig.DisableReasoning).To(BeFalse())
})
})
var _ = Describe("SetModelAndConfig metadata passthrough (chat completions)", func() {
var modelDir string
BeforeEach(func() {
var err error
modelDir, err = os.MkdirTemp("", "localai-test-models-*")
Expect(err).ToNot(HaveOccurred())
})
AfterEach(func() { _ = os.RemoveAll(modelDir) })
buildApp := func() (*echo.Echo, **config.ModelConfig) {
Expect(os.WriteFile(filepath.Join(modelDir, "test-model.yaml"),
[]byte("name: test-model\nbackend: llama\n"), 0644)).To(Succeed())
ss := &system.SystemState{Model: system.Model{ModelsPath: modelDir}}
appConfig := config.NewApplicationConfig()
appConfig.SystemState = ss
mcl := config.NewModelConfigLoader(modelDir)
ml := model.NewModelLoader(ss)
re := NewRequestExtractor(mcl, ml, appConfig)
captured := new(*config.ModelConfig)
app := echo.New()
app.POST("/v1/chat/completions",
func(c echo.Context) error {
if cfg, ok := c.Get(CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig); ok {
*captured = cfg
}
return c.String(http.StatusOK, "ok")
},
re.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.OpenAIRequest) }),
func(next echo.HandlerFunc) echo.HandlerFunc {
return func(c echo.Context) error {
if err := re.SetOpenAIRequest(c); err != nil {
return err
}
return next(c)
}
},
)
return app, captured
}
It("stamps request metadata onto the config", func() {
app, captured := buildApp()
body := `{"model":"test-model","messages":[{"role":"user","content":"hi"}],` +
`"metadata":{"preserve_thinking":"true"}}`
req := httptest.NewRequest(http.MethodPost, "/v1/chat/completions", strings.NewReader(body))
req.Header.Set("Content-Type", "application/json")
rec := httptest.NewRecorder()
app.ServeHTTP(rec, req)
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(*captured).ToNot(BeNil())
Expect((*captured).RequestMetadata).To(HaveKeyWithValue("preserve_thinking", "true"))
})
})

View File

@@ -438,6 +438,36 @@ pipeline:
reasoning_effort: none # overrides the LLM model's own reasoning_effort
```
#### Custom `chat_template_kwargs`
Some jinja chat templates expose extra variables beyond `enable_thinking` /
`reasoning_effort` (for example Qwen3's `preserve_thinking`). Set arbitrary key/values in
the model config and they are forwarded to the backend's `chat_template_kwargs` as-is, so
you don't need a dedicated server option per template variable:
```yaml
name: qwen3
chat_template_kwargs:
preserve_thinking: true
```
You can also override (or add) any of these per request through the OpenAI `metadata`
field on `/v1/chat/completions`. Values are strings; `"true"` / `"false"` are coerced to
booleans, anything else is passed through as a string:
```json
{
"model": "qwen3",
"messages": [{"role": "user", "content": "hi"}],
"metadata": { "preserve_thinking": "true", "enable_thinking": "false" }
}
```
Per-request `metadata` overrides the model config defaults and the reasoning-config levers,
and (for `enable_thinking` / `reasoning_effort`) takes effect across every backend that
reads them, not just llama.cpp. Typed (non-boolean) values are only supported through the
model YAML `chat_template_kwargs`, where YAML preserves the type.
### Multimodal Backend Options
| Option | Type | Default | Description |

View File

@@ -109,6 +109,23 @@ func (m *MockBackend) Predict(ctx context.Context, in *pb.PredictOptions) (*pb.R
}, nil
}
// ECHO_PREDICT_METADATA lets tests assert exactly what the REST layer
// forwarded to the backend as gRPC PredictOptions.Metadata (e.g. the
// chat_template_kwargs blob and the standalone enable_thinking/reasoning_effort
// keys). The reply carries a JSON snapshot of in.Metadata so an HTTP-level
// test can pin the request -> gRPC mapping without a new RPC.
if strings.Contains(in.Prompt, "ECHO_PREDICT_METADATA") {
payload, err := json.Marshal(in.Metadata)
if err != nil {
return nil, fmt.Errorf("mock backend echo metadata error: %w", err)
}
return &pb.Reply{
Message: payload,
Tokens: int32(len(in.Metadata)),
PromptTokens: 1,
}, nil
}
// ECHO_SERVED_MODEL returns the loaded model file path so router e2e
// tests can verify which candidate actually served the request without
// adding a new RPC. The router fans out to a single backend process per