From b6fed262719d2ffd6dac42c3ec132ff40eabdc13 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 21 May 2026 15:54:38 +0000
Subject: [PATCH] chore(turboquant): retreat pin to 4c1c3ac0 to skip fork GPU
 regression
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI on the prior 2cbfdc62 pin confirmed our grpc-server.cpp/patch fix
works (tests-turboquant-grpc + all multiarch turboquant builds passed),
but every GPU singlearch turboquant build now hits a static-assertion
error in the fork's own ggml/src/ggml-cuda/fattn-mma-f16.cuh — a
regression introduced by the May 14 #22880 `HIP: RDNA3 mma FA` refactor
(file went from 1855 to 2049 lines).

4c1c3ac0 (2026-05-13 22:12 UTC) is the last commit before that refactor
and still has every API piece grpc-server.cpp depends on (DRAFT_SIMPLE
enum, nested common_params_speculative, model_tgt, get_media_marker(),
common_speculative_types_from_names). MTP support landed later (May 16)
and is not exercised by grpc-server.cpp.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 backend/cpp/turboquant/Makefile             | 2 +-
 backend/cpp/turboquant/patch-grpc-server.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/backend/cpp/turboquant/Makefile b/backend/cpp/turboquant/Makefile
index cdfb0489f..062ccda2d 100644
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@
 
 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=2cbfdc62a1a047b01377948dfdede8cb6a744866
+TURBOQUANT_VERSION?=4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
 
 CMAKE_ARGS?=
diff --git a/backend/cpp/turboquant/patch-grpc-server.sh b/backend/cpp/turboquant/patch-grpc-server.sh
index c9555052e..3a61e21c4 100755
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -9,7 +9,7 @@
 # fork and upstream (flat vs nested `common_params_speculative`, missing
 # `get_media_marker()`, `ctx_server.impl->model` vs `model_tgt`, and a
 # LOCALAI_LEGACY_LLAMA_CPP_SPEC compile gate). As of TURBOQUANT_VERSION
-# 2cbfdc62 the fork has rebased past ggml-org/llama.cpp#21962, #22397 and
+# 4c1c3ac0 the fork has rebased past ggml-org/llama.cpp#21962, #22397 and
 # #22838, so the shared grpc-server.cpp compiles unmodified against the fork.
 # Only the fork-specific KV-cache enum entries remain.
 #