mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-18 21:58:58 -04:00
feat: generic chat_template_kwargs (model config + per-request metadata) (#10359)
* feat(config): add chat_template_kwargs model field + resolver Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier, plus ResolveChatTemplateKwargs which layers the config map under coerced request metadata. Foundation for generic jinja chat-template kwargs (issue #10329). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): forward resolved chat_template_kwargs blob to backends gRPCPredictOpts now merges per-request client metadata over the server-derived enable_thinking/reasoning_effort (reaching all backends via the standalone keys) and serialises the resolved chat_template_kwargs map into a JSON blob for llama.cpp, written last so a client cannot clobber it. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(http): wire request metadata to config.RequestMetadata The OpenAI request metadata field was parsed but unused; stamp it onto the per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs overrides. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks) Replace the per-key enable_thinking/reasoning_effort handling in both the streaming and non-streaming chat paths with a single block that parses the chat_template_kwargs JSON blob resolved by the Go layer and merges every key into body_json. New jinja template levers (e.g. preserve_thinking) now need no C++ change. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: document custom chat_template_kwargs (model + per-request) Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the received PredictOptions.Metadata, and an app_test.go spec that drives a real /v1/chat/completions request (model chat_template_kwargs + per-request metadata override) and asserts the exact metadata + chat_template_kwargs blob the REST layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(config): grandfather chat_template_kwargs in registry coverage chat_template_kwargs is a free-form map[string]any (like engine_args, already on the list), not a scalar the config UI registry can surface, so it is exempt from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries failure introduced by the new field. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -1922,25 +1922,27 @@ public:
|
||||
body_json["min_p"] = data["min_p"];
|
||||
}
|
||||
|
||||
// Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
|
||||
// Forward the chat_template_kwargs the Go layer resolved (model config
|
||||
// chat_template_kwargs + per-request metadata: enable_thinking,
|
||||
// reasoning_effort, preserve_thinking, ...). One generic merge replaces
|
||||
// the previous per-key handling - new template levers need no C++ change.
|
||||
// oaicompat_chat_params_parse reads these from body_json.
|
||||
const auto& metadata = request->metadata();
|
||||
auto et_it = metadata.find("enable_thinking");
|
||||
if (et_it != metadata.end()) {
|
||||
if (!body_json.contains("chat_template_kwargs")) {
|
||||
body_json["chat_template_kwargs"] = json::object();
|
||||
auto ctk_it = metadata.find("chat_template_kwargs");
|
||||
if (ctk_it != metadata.end() && !ctk_it->second.empty()) {
|
||||
try {
|
||||
json ctk = json::parse(ctk_it->second);
|
||||
if (ctk.is_object()) {
|
||||
if (!body_json.contains("chat_template_kwargs")) {
|
||||
body_json["chat_template_kwargs"] = json::object();
|
||||
}
|
||||
for (auto& el : ctk.items()) {
|
||||
body_json["chat_template_kwargs"][el.key()] = el.value();
|
||||
}
|
||||
}
|
||||
} catch (const std::exception & e) {
|
||||
SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
|
||||
}
|
||||
body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
|
||||
}
|
||||
|
||||
// Pass reasoning_effort via chat_template_kwargs too: the lever
|
||||
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
|
||||
// from enable_thinking which those templates ignore.
|
||||
auto re_it = metadata.find("reasoning_effort");
|
||||
if (re_it != metadata.end() && !re_it->second.empty()) {
|
||||
if (!body_json.contains("chat_template_kwargs")) {
|
||||
body_json["chat_template_kwargs"] = json::object();
|
||||
}
|
||||
body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
|
||||
}
|
||||
|
||||
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
|
||||
@@ -2756,25 +2758,26 @@ public:
|
||||
body_json["min_p"] = data["min_p"];
|
||||
}
|
||||
|
||||
// Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
|
||||
// Forward the chat_template_kwargs the Go layer resolved (model config
|
||||
// chat_template_kwargs + per-request metadata: enable_thinking,
|
||||
// reasoning_effort, preserve_thinking, ...). One generic merge replaces
|
||||
// the previous per-key handling - new template levers need no C++ change.
|
||||
const auto& predict_metadata = request->metadata();
|
||||
auto predict_et_it = predict_metadata.find("enable_thinking");
|
||||
if (predict_et_it != predict_metadata.end()) {
|
||||
if (!body_json.contains("chat_template_kwargs")) {
|
||||
body_json["chat_template_kwargs"] = json::object();
|
||||
auto predict_ctk_it = predict_metadata.find("chat_template_kwargs");
|
||||
if (predict_ctk_it != predict_metadata.end() && !predict_ctk_it->second.empty()) {
|
||||
try {
|
||||
json ctk = json::parse(predict_ctk_it->second);
|
||||
if (ctk.is_object()) {
|
||||
if (!body_json.contains("chat_template_kwargs")) {
|
||||
body_json["chat_template_kwargs"] = json::object();
|
||||
}
|
||||
for (auto& el : ctk.items()) {
|
||||
body_json["chat_template_kwargs"][el.key()] = el.value();
|
||||
}
|
||||
}
|
||||
} catch (const std::exception & e) {
|
||||
SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
|
||||
}
|
||||
body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
|
||||
}
|
||||
|
||||
// Pass reasoning_effort via chat_template_kwargs too: the lever
|
||||
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
|
||||
// from enable_thinking which those templates ignore.
|
||||
auto predict_re_it = predict_metadata.find("reasoning_effort");
|
||||
if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
|
||||
if (!body_json.contains("chat_template_kwargs")) {
|
||||
body_json["chat_template_kwargs"] = json::object();
|
||||
}
|
||||
body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
|
||||
}
|
||||
|
||||
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
|
||||
|
||||
Reference in New Issue
Block a user