mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-24 08:38:02 -04:00
chore(turboquant): retreat pin to 4c1c3ac0 to skip fork GPU regression
CI on the prior 2cbfdc62 pin confirmed our grpc-server.cpp/patch fix works (tests-turboquant-grpc + all multiarch turboquant builds passed), but every GPU singlearch turboquant build now hits a static-assertion error in the fork's own ggml/src/ggml-cuda/fattn-mma-f16.cuh — a regression introduced by the May 14 #22880 `HIP: RDNA3 mma FA` refactor (file went from 1855 to 2049 lines). 4c1c3ac0 (2026-05-13 22:12 UTC) is the last commit before that refactor and still has every API piece grpc-server.cpp depends on (DRAFT_SIMPLE enum, nested common_params_speculative, model_tgt, get_media_marker(), common_speculative_types_from_names). MTP support landed later (May 16) and is not exercised by grpc-server.cpp. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
|
||||
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
|
||||
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
|
||||
TURBOQUANT_VERSION?=2cbfdc62a1a047b01377948dfdede8cb6a744866
|
||||
TURBOQUANT_VERSION?=4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c
|
||||
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
|
||||
|
||||
CMAKE_ARGS?=
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
# fork and upstream (flat vs nested `common_params_speculative`, missing
|
||||
# `get_media_marker()`, `ctx_server.impl->model` vs `model_tgt`, and a
|
||||
# LOCALAI_LEGACY_LLAMA_CPP_SPEC compile gate). As of TURBOQUANT_VERSION
|
||||
# 2cbfdc62 the fork has rebased past ggml-org/llama.cpp#21962, #22397 and
|
||||
# 4c1c3ac0 the fork has rebased past ggml-org/llama.cpp#21962, #22397 and
|
||||
# #22838, so the shared grpc-server.cpp compiles unmodified against the fork.
|
||||
# Only the fork-specific KV-cache enum entries remain.
|
||||
#
|
||||
|
||||
Reference in New Issue
Block a user