diff --git a/backend/cpp/turboquant/Makefile b/backend/cpp/turboquant/Makefile index cd3d1ab80..776b82ef1 100644 --- a/backend/cpp/turboquant/Makefile +++ b/backend/cpp/turboquant/Makefile @@ -1,7 +1,7 @@ # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant. # Auto-bumped nightly by .github/workflows/bump_deps.yaml. -TURBOQUANT_VERSION?=627ebbc6e27727bd4f65422d8aa60b13404993c8 +TURBOQUANT_VERSION?=4d24ad87b8ed2ad160809af41930f1e04b83f234 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant CMAKE_ARGS?= diff --git a/backend/cpp/turboquant/patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch b/backend/cpp/turboquant/patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch new file mode 100644 index 000000000..5225377d0 --- /dev/null +++ b/backend/cpp/turboquant/patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch @@ -0,0 +1,47 @@ +From: LocalAI turboquant backend maintainers +Subject: ggml-hip: add F16-K + TURBO-V fattn-vec template instances + +Upstream commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases +in fattn.cu") added three new template instance files under ggml-cuda/: + + - fattn-vec-instance-f16-turbo2_0.cu + - fattn-vec-instance-f16-turbo3_0.cu + - fattn-vec-instance-f16-turbo4_0.cu + +and registered them in ggml/src/ggml-cuda/CMakeLists.txt. The companion +dispatch cases FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO{2,3,4}_0) +were added to ggml/src/ggml-cuda/fattn.cu, which is shared with the HIP +build path via hipify. + +However, ggml/src/ggml-hip/CMakeLists.txt carries its own explicit list of +template instance sources (used when GGML_CUDA_FA_ALL_QUANTS is OFF, which +is the default) and was never updated for the new F16-K + TURBO-V combos. +The HIP build therefore compiles the dispatch cases (which reference +ggml_cuda_flash_attn_ext_vec_case) without ever compiling +the matching template instantiations, causing a link-time failure in the +-gpu-rocm-hipblas-turboquant CI job. + +Add the three new template instance files to ggml-hip's list so the HIP +build links cleanly. Drop this patch once the fork picks up the +corresponding upstream sync in ggml-hip/CMakeLists.txt. + +--- a/ggml/src/ggml-hip/CMakeLists.txt ++++ b/ggml/src/ggml-hip/CMakeLists.txt +@@ -85,14 +85,17 @@ else() + ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo3_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-q8_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo3_0.cu ++ ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo3_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo2_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-q8_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo2_0.cu ++ ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo2_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo2_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo3_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo4_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-q8_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo4_0.cu ++ ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo4_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo3_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo4_0.cu + ../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo2_0.cu