chore(turboquant): bump fork to 4d24ad87 and patch ggml-hip for new f16-turbo fattn-vec instances

Bump TURBOQUANT_VERSION from 627ebbc6 to 4d24ad87, which pulls in upstream commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu"). That commit adds three new template instance files under ggml-cuda/template-instances/: - fattn-vec-instance-f16-turbo2_0.cu - fattn-vec-instance-f16-turbo3_0.cu - fattn-vec-instance-f16-turbo4_0.cu and wires matching FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO{2,3,4}_0) dispatch cases into fattn.cu. The dispatch cases are compiled into the HIP build (fattn.cu is shared with ggml-hip via hipify), but the fork forgot to mirror the new source files into ggml/src/ggml-hip/CMakeLists.txt. CMake's ROCm branch carries a hand-curated template-instance list (used when GGML_CUDA_FA_ALL_QUANTS is OFF, which is the default), so the HIP build ends up with the extern template declarations but no matching instantiations — the -gpu-rocm-hipblas-turboquant job failed at link time (~90min into the 3h+ build). Add patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch, which the existing apply-patches.sh machinery applies to the cloned fork sources after fetch. The patch appends the three new f16-turbo instance files to ggml-hip's source list in the same interleaved order used by ggml-cuda's CMakeLists.txt. Drop this patch once the fork syncs the ROCm list (the build will fail fast if the anchor context goes stale, which is the signal to retire it). CUDA builds were unaffected (ggml-cuda's CMakeLists.txt was updated upstream) — the failure was isolated to HIP. Assisted-by: Claude:claude-opus-4-7 [Claude Code]
chore: ⬆️ Update ggml-org/llama.cpp to 5a4cd6741fc33227cdacb329f355ab21f8481de2 (#9479 )
2026-05-24 16:51:44 -04:00 · 2026-04-22 07:13:47 +00:00 · 2026-04-22 08:58:19 +02:00 · 2026-04-22 08:22:05 +02:00
2 changed files with 56 additions and 1 deletions
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=cf8b0dbda9ac0eac30ee33f87bc6702ead1c4664
+LLAMA_VERSION?=5a4cd6741fc33227cdacb329f355ab21f8481de2
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -1,4 +1,59 @@
 ---
+- name: "qwen3.6-35b-a3b-claude-4.6-opus-reasoning-distilled"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
+  description: |
+    # 🔥 Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
+
+    A reasoning SFT fine-tune of `Qwen/Qwen3.6-35B-A3B` on chain-of-thought (CoT) distillation mostly sourced from Claude Opus 4.6. The goal is to preserve Qwen3.6's strong agentic coding and reasoning base while nudging the model toward structured Claude Opus-style reasoning traces and more stable long-form problem solving.
+
+    The training path is text-only. The Qwen3.6 base architecture includes a vision encoder, but this fine-tuning run did not train on image or video examples.
+
+      - **Developed by:** @hesamation
+      - **Base model:** `Qwen/Qwen3.6-35B-A3B`
+      - **License:** apache-2.0
+
+    This fine-tuning run is inspired by Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, including the notebook/training workflow style and Claude Opus reasoning-distillation direction.
+
+    [](https://x.com/Hesamation) [](https://discord.gg/vtJykN3t)
+
+    ## Benchmark Results
+
+    The MMLU-Pro pass used 70 total questions per model: `--limit 5` across 14 MMLU-Pro subjects. Treat this as a smoke/comparative check, not a release-quality full benchmark.
+
+    ...
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qwen
+    - reasoning
+  icon: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_35b_a3b_score.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf
+      presence_penalty: 1.5
+      repeat_penalty: 1
+      temperature: 0.7
+      top_k: 20
+      top_p: 0.8
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf
+      sha256: fd3bf7586354890a2710d69357c30fb221a31eecf9f3cd9418257d9289e02765
+      uri: https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/resolve/main/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf
 - name: "qwen3.5-9b-glm5.1-distill-v1"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls: