ci(turboquant): drop the ROCm/hipblas build flavor

The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin: beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate), its llama.cpp base fails to compile the flash-attention MMA f16 kernels for head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero / non-constant static asserts in fattn-mma-f16.cuh). That is a deep ggml-on-ROCm kernel issue, not something a small fork patch can paper over. Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)
2026-06-07 00:06:51 -04:00 · 2026-06-06 22:43:47 +00:00 · 2026-06-06 22:03:48 +00:00 · 2026-06-06 20:39:10 +00:00 · 2026-06-06 20:39:10 +00:00 · 2026-06-06 13:53:10 +02:00
14 changed files with 321 additions and 165 deletions
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -1766,20 +1766,6 @@ include:
    dockerfile: "./backend/Dockerfile.llama-cpp"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'hipblas'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-rocm-hipblas-turboquant'
-    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-rocm-amd64'
-    runs-on: 'ubuntu-latest'
-    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-    skip-drivers: 'false'
-    backend: "turboquant"
-    dockerfile: "./backend/Dockerfile.turboquant"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=1520eda980564241434b791ce2bbbd128c4be9ea
+IK_LLAMA_VERSION?=6b9de3dbaa21ae95ea80638e5ee836795cc48c93
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -482,23 +482,13 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    if (!request->draftmodel().empty()) {
        params.speculative.draft.mparams.path = request->draftmodel();
        // Default to draft type if a draft model is set but no explicit type.
-        // Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
-        // vector; the turboquant fork still uses the legacy scalar. The
-        // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
-        // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
-        // Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
-        // in ggml-org/llama.cpp#22964; the fork still uses the old name.
-#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
-        if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
-            params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
-        }
-#else
+        // Upstream made the speculative type a vector (ggml-org/llama.cpp#22838)
+        // and renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE (#22964).
        const bool no_spec_type = params.speculative.types.empty() ||
            (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
        if (no_spec_type) {
            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
        }
-#endif
    }

    //  params.model_alias ??
@@ -574,9 +564,10 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    // tokens (0 disables the minimum). Match upstream's default (256). This
    // field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
    // also shifted from a fixed cadence to a minimum spacing. The turboquant
-    // fork branched before the field existed, so skip it on the legacy path
-    // (LOCALAI_LEGACY_LLAMA_CPP_SPEC is injected by patch-grpc-server.sh).
-#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+    // fork still lacks common_params::checkpoint_min_step, so skip it there
+    // (LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP is injected by
+    // backend/cpp/turboquant/patch-grpc-server.sh).
+#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
    params.checkpoint_min_step = 256;
 #endif

@@ -752,7 +743,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                params.cache_idle_slots = false;
            }

-#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
        // --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
        // 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
        // `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
@@ -906,17 +897,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt

        // Speculative decoding options
        } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
-#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
-            // Fork only knows a single scalar `type`. Take the first comma-
-            // separated value and assign it via the singular helper.
-            std::string first = optval_str;
-            const auto comma = first.find(',');
-            if (comma != std::string::npos) first = first.substr(0, comma);
-            auto type = common_speculative_type_from_name(first);
-            if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
-                params.speculative.type = type;
-            }
-#else
            // Upstream switched to a vector of types (comma-separated for multi-type
            // chaining via common_speculative_types_from_names). We keep accepting a
            // single value here, but also tolerate comma-separated lists.
@@ -945,7 +925,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            if (!parsed.empty()) {
                params.speculative.types = parsed;
            }
-#endif
        } else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
            if (optval != NULL) {
                try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -983,21 +962,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            // shares the target context size. Accept the option for backward
            // compatibility but silently ignore it.

-// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
-// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
-// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
-// fields. The turboquant fork branched before that, so its build defines
-// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
-// keys become unrecognized (silently dropped, like any unknown opt) for it.
-//
-// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
-// closing-brace position of the `draft_ctx_size` branch on purpose: in the
-// legacy build the chain ends here (the brace closes draft_ctx_size), and in
-// the modern build the chain continues with `} else if (...)` instead, so the
-// brace count stays balanced under both branches of the preprocessor.
-#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
-        }
-#else
        // --- ngram_mod family (upstream --spec-ngram-mod-*) ---
        } else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
            if (optval != NULL) {
@@ -1127,7 +1091,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            }
            if (!cur.empty()) flush(cur);
        }
-#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
    }

    // Set params.n_parallel from environment variable if not set via options (fallback)
@@ -1177,15 +1140,11 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            params.tensor_buft_overrides.push_back({nullptr, nullptr});
        }
    }
-    // The draft tensor_buft_overrides are only populated under the modern
-    // (post-#22838) layout, whose population code is itself gated by
-    // LOCALAI_LEGACY_LLAMA_CPP_SPEC above. The turboquant fork lacks
-    // common_params_speculative::draft entirely, so skip the sentinel there too.
-#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
+    // the main-model handling above.
    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
    }
-#endif

    // TODO: Add yarn

--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
+TURBOQUANT_VERSION?=7d9715f1f071fa07c7b2ad3dbfd320b314139e65
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -4,21 +4,19 @@
 #
 #   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
 #      fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
-#   2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
-#      server-side random per-instance marker) with the legacy "<__media__>"
-#      literal. The fork branched before that PR, so server-common.cpp has no
-#      get_media_marker symbol. The fork's mtmd_default_marker() still returns
-#      "<__media__>", and Go-side tooling falls back to that sentinel when the
-#      backend does not expose media_marker, so substituting the literal keeps
-#      behavior identical on the turboquant path.
-#   3. Revert the `common_params_speculative` field references to the
-#      pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
-#      struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
-#      the turboquant fork branched before that PR and still exposes the flat
-#      `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
-#      below map the new nested paths back to the legacy flat names so the
-#      shared grpc-server.cpp keeps compiling against the fork's common.h.
-#      Drop this block once the fork rebases past #22397.
+#   2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file
+#      so the grpc-server option parser skips the two references to
+#      common_params::checkpoint_min_step (the default and the option handler).
+#      That field does not exist in the fork yet; drop this once it does.
+#
+# The fork used to lag upstream on the whole common_params_speculative refactor
+# (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename (#22838) and
+# get_media_marker (#21962), which required a much larger compat shim here
+# (flat-field sed renames + a coarse LOCALAI_LEGACY_LLAMA_CPP_SPEC define). The
+# fork has since rebased past all of those, so the only remaining gap is
+# checkpoint_min_step. If a future bump reintroduces a divergence, add a narrow
+# guard in grpc-server.cpp keyed on a fork-specific macro and inject it here
+# rather than resurrecting the coarse one.
 #
 # We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
 # under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
@@ -72,72 +70,20 @@ else
    echo "==> KV allow-list patch OK"
 fi

-if grep -q 'get_media_marker()' "$SRC"; then
-    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
-    # Only one call site today (ModelMetadata), but replace all occurrences to
-    # stay robust if upstream adds more. Use a temp file to avoid relying on
-    # sed -i portability (the builder image uses GNU sed, but keeping this
-    # consistent with the awk block above).
-    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-    echo "==> get_media_marker() substitution OK"
+# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file so
+#    the grpc-server option parser skips the two references to
+#    common_params::checkpoint_min_step (the default assignment and the option
+#    handler). That field does not exist in the fork yet. Drop this block once
+#    the fork rebases past the bump that added checkpoint_min_step.
+if grep -q '^#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP' "$SRC"; then
+    echo "==> $SRC already defines LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP, skipping"
 else
-    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
-fi
-
-if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
-    echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
-    # Each substitution is the exact post-refactor path → legacy flat field.
-    # Order doesn't matter because the source paths are disjoint, but we keep
-    # the most-specific (mparams.path) first for readability.
-    sed -E \
-        -e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
-        -e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
-        -e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
-        -e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
-        -e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
-        -e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
-        -e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
-        -e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
-        -e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
-        -e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
-        "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-    echo "==> speculative field rename OK"
-else
-    echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
-fi
-
-# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
-#    ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
-#    exposes the field as `model` on `server_context_impl`. The two call sites
-#    are in the Rerank and ModelMetadata RPC handlers.
-if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
-    echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
-    sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-    echo "==> model_tgt rename OK"
-else
-    echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
-fi
-
-# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
-#    grpc-server option parser skips the new option-handler blocks (ngram_mod,
-#    ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
-#    draft.tensor_buft_overrides) introduced for the post-#22838 layout, the
-#    draft.tensor_buft_overrides sentinel termination, and the
-#    common_params::checkpoint_min_step default/option (added with the
-#    35c9b1f3 bump). Those blocks reference struct fields that simply do not
-#    exist in the fork.
-if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
-    echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
-else
-    echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
-    # Insert the define before the very first `#include` so it precedes all the
-    # speculative-decoding code paths.
+    echo "==> patching $SRC to define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top"
+    # Insert the define before the very first `#include` so it precedes the
+    # checkpoint_min_step references.
    awk '
        !done && /^#include/ {
-            print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
+            print "#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP 1"
            print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
            print ""
            done = 1
@@ -145,13 +91,13 @@ else
        { print }
        END {
            if (!done) {
-                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
+                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP" > "/dev/stderr"
                exit 1
            }
        }
    ' "$SRC" > "$SRC.tmp"
    mv "$SRC.tmp" "$SRC"
-    echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
+    echo "==> LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP define OK"
 fi

 echo "==> all patches applied"
--- a/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
+++ b/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
@@ -0,0 +1,55 @@
+hip: port the turboquant CUDA additions that ggml's HIP shim doesn't cover
+
+The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs
+that ggml's HIP (and MUSA) compatibility layer does not provide, breaking
+the -gpu-rocm-hipblas-turboquant build:
+
+  1. ggml_cuda_copy2d_across_devices() (host-staged cross-device copy for
+     split mul_mat output) uses the CUDA 3D-peer copy APIs
+     cudaMemcpy3DPeerParms / make_cudaPitchedPtr / make_cudaExtent /
+     cudaMemcpy3DPeerAsync. HIP genuinely does not support these (see the
+     fork's own comment "HIP does not support cudaMemcpy3DPeerAsync"), so
+     guard the peer fast path with #if !defined(GGML_USE_HIP) &&
+     !defined(GGML_USE_MUSA) -- matching how the fork already guards the
+     same API for the sibling 2D copy -- and fall through to the existing
+     cudaMemcpyAsync staging fallback below (functionally identical,
+     slightly slower on multi-GPU ROCm).
+
+  2. ggml_backend_cuda_device_event_new() creates its event with plain
+     cudaEventCreate, which ggml's HIP shim does not alias (it only aliases
+     cudaEventCreateWithFlags). Use cudaEventCreateWithFlags(..., 
+     cudaEventDisableTiming) -- exactly what the rest of this file already
+     does (cf. lines ~1034, ~3461) and HIP-safe.
+
+CUDA builds are unaffected. Drop the relevant hunk once the fork HIP-ports
+these; apply-patches.sh fails fast if an anchor goes stale.
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 0427e6b..6352e6a 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -1933,6 +1933,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
+     size_t width, size_t height, cudaStream_t dst_stream, cudaStream_t src_stream) {
+ 
+     const auto & info = ggml_cuda_info();
+#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)  // 3D-peer copy types unmapped by ggml's HIP/MUSA shim; use staging fallback below
+     if (info.peer_access[src_device][dst_device]) {
+         cudaMemcpy3DPeerParms p = {};
+         p.dstDevice = dst_device;
+@@ -1942,6 +1943,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
+         p.extent = make_cudaExtent(width, height, 1);
+         return cudaMemcpy3DPeerAsync(&p, dst_stream);
+     }
+#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
+ 
+     // Fallback: stage all rows through a single contiguous pinned buffer
+     int prev_device = ggml_cuda_get_device();
+@@ -5714,7 +5716,7 @@ static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_
+     ggml_cuda_set_device(dev_ctx->device);
+ 
+     cudaEvent_t event;
+-    CUDA_CHECK(cudaEventCreate(&event));
+    CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
+ 
+     return new ggml_backend_event {
+         /* .device  = */ dev,
--- a/backend/go/parakeet-cpp/Makefile
+++ b/backend/go/parakeet-cpp/Makefile
@@ -1,6 +1,6 @@
 # parakeet-cpp backend Makefile.
 #
-# Upstream pin lives below as PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
+# Upstream pin lives below as PARAKEET_VERSION?=50dfc24b4faa4ee23a1f59401f1d0c87fc4042b0
 # (.github/bump_deps.sh) can find and update it - matches the
 # whisper.cpp / ds4 / vibevoice-cpp convention.
 #
@@ -15,7 +15,7 @@
 # That's what the L0 smoke test uses. The default target below does the
 # proper clone-at-pin + cmake build so CI doesn't need a side-checkout.

-PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
+PARAKEET_VERSION?=50dfc24b4faa4ee23a1f59401f1d0c87fc4042b0
 PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp

 GOCMD?=go
--- a/backend/go/parakeet-cpp/batcher.go
+++ b/backend/go/parakeet-cpp/batcher.go
@@ -7,8 +7,12 @@ import "time"
 type batchRequest struct {
 	pcm     []float32
 	decoder int32
-	tag     string
-	reply   chan batchReply
+	// language is the per-request target locale ("" means the model default).
+	// parakeet.cpp's batched C-API takes ONE target_lang for the whole batch,
+	// so the dispatcher only coalesces requests that share a language.
+	language string
+	tag      string
+	reply    chan batchReply
 }

 // batchReply carries one per-item JSON object string (an element of the C-API's
@@ -43,13 +47,25 @@ func newBatcher(maxSize int, maxWait time.Duration, runBatch func([]*batchReques
 // run is the dispatcher loop: accumulate submitted requests until either maxSize
 // is reached or maxWait elapses since the first queued request, then dispatch.
 // Exits when stop is closed (draining any partially-filled batch first).
+//
+// A batch carries ONE language (parakeet.cpp's batched C-API takes a single
+// target_lang), so a request whose language differs from the batch leader is
+// not coalesced: it is held in carry and becomes the leader of the next batch.
+// carry is therefore never dropped and its caller never deadlocks: every batch
+// (including a lone carry on stop) is dispatched, and runBatch replies to all.
 func (b *batcher) run(stop <-chan struct{}) {
+	var carry *batchRequest
 	for {
 		var first *batchRequest
-		select {
-		case first = <-b.submit:
-		case <-stop:
-			return
+		if carry != nil {
+			// A mismatched request from the previous fill leads this batch.
+			first, carry = carry, nil
+		} else {
+			select {
+			case first = <-b.submit:
+			case <-stop:
+				return
+			}
 		}
 		batch := []*batchRequest{first}

@@ -64,12 +80,22 @@ func (b *batcher) run(stop <-chan struct{}) {
 		for len(batch) < b.maxSize {
 			select {
 			case r := <-b.submit:
+				if r.language != first.language {
+					// Different language: carry it to the next batch so this
+					// batch stays single-language, then dispatch what we have.
+					carry = r
+					break fill
+				}
 				batch = append(batch, r)
 			case <-timer.C:
 				break fill
 			case <-stop:
 				timer.Stop()
 				b.runBatch(batch)
+				// Don't strand a carried request's caller on shutdown.
+				if carry != nil {
+					b.runBatch([]*batchRequest{carry})
+				}
 				return
 			}
 		}
--- a/backend/go/parakeet-cpp/batcher_test.go
+++ b/backend/go/parakeet-cpp/batcher_test.go
@@ -105,4 +105,60 @@ var _ = Describe("batcher", func() {
 		go func() { <-rep }()
 		Eventually(dispatched, "2s").Should(Receive(Equal(1)))
 	})
+
+	It("never coalesces requests with different languages into one batch", func() {
+		// parakeet.cpp's batched C-API takes ONE target_lang per batch, so the
+		// dispatcher must keep every dispatched batch single-language. Submit a
+		// mix of languages and assert (a) no batch ever carries more than one
+		// distinct language and (b) every submitted request still gets a reply
+		// (the mismatched carry-over is never dropped).
+		var mu sync.Mutex
+		var langsPerBatch [][]string
+		run := func(reqs []*batchRequest) {
+			seen := map[string]struct{}{}
+			var distinct []string
+			for _, r := range reqs {
+				if _, ok := seen[r.language]; !ok {
+					seen[r.language] = struct{}{}
+					distinct = append(distinct, r.language)
+				}
+			}
+			mu.Lock()
+			langsPerBatch = append(langsPerBatch, distinct)
+			mu.Unlock()
+			echoReply(reqs)
+		}
+		// Large window + size so the fill loop stays open across submits and the
+		// language constraint (not the timer) is what splits the batches.
+		b := newBatcher(16, 200*time.Millisecond, run)
+		stop := make(chan struct{})
+		go b.run(stop)
+		defer close(stop)
+
+		langs := []string{"en", "en", "de", "de", "en", "fr", "fr"}
+		const N = 7
+		var wg sync.WaitGroup
+		got := make([]string, N)
+		for i := 0; i < N; i++ {
+			wg.Add(1)
+			go func(i int) {
+				defer wg.Done()
+				rep := make(chan batchReply, 1)
+				b.submit <- &batchRequest{tag: string(rune('a' + i)), language: langs[i], reply: rep}
+				got[i] = (<-rep).json
+			}(i)
+		}
+		wg.Wait()
+
+		mu.Lock()
+		defer mu.Unlock()
+		// Invariant: every dispatched batch is single-language.
+		for _, distinct := range langsPerBatch {
+			Expect(len(distinct)).To(Equal(1), "a batch coalesced more than one language: %v", distinct)
+		}
+		// Liveness: every request got a reply (carry-over never stranded).
+		for i := 0; i < N; i++ {
+			Expect(got[i]).To(Equal(string(rune('a' + i))))
+		}
+	})
 })
--- a/backend/go/parakeet-cpp/goparakeetcpp.go
+++ b/backend/go/parakeet-cpp/goparakeetcpp.go
@@ -48,6 +48,13 @@ var (
 	// side reads them as const float*/const int*.
 	CppTranscribePcmBatchJSON func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32) uintptr

+	// CppTranscribePcmBatchJSONLang is the multilingual variant of the batched
+	// JSON entry point: identical, plus a trailing target_lang. "" (the model
+	// default, "auto") is passed for non-prompt models, which ignore it; an
+	// unknown locale on a prompt model returns 0 and sets last_error. Present
+	// only in newer libparakeet.so; nil falls back to CppTranscribePcmBatchJSON.
+	CppTranscribePcmBatchJSONLang func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32, targetLang string) uintptr
+
 	// Cache-aware streaming (RNN-T) entry points. stream_begin returns 0 for
 	// non-streaming models. feed/finalize return a malloc'd char* (uintptr,
 	// freed via CppFreeString); feed writes 1 to *eouOut on an <EOU>/<EOB>.
@@ -55,6 +62,11 @@ var (
 	CppStreamFeed     func(s uintptr, pcm []float32, nSamples int32, eouOut unsafe.Pointer) uintptr
 	CppStreamFinalize func(s uintptr) uintptr
 	CppStreamFree     func(s uintptr)
+
+	// CppStreamBeginLang is the multilingual variant of stream_begin: identical,
+	// plus a trailing target_lang ("" means the model default). Present only in
+	// newer libparakeet.so; nil falls back to CppStreamBegin.
+	CppStreamBeginLang func(ctx uintptr, targetLang string) uintptr
 )

 // streamChunkSamples is how much 16 kHz mono PCM we hand to stream_feed per
@@ -187,8 +199,19 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
 	if len(reqs) > 0 {
 		dec = reqs[0].decoder
 	}
+	// All requests in a batch share one language (the batcher coalesces only
+	// same-language requests), so any element's language describes the batch.
+	lang := ""
+	if len(reqs) > 0 {
+		lang = reqs[0].language
+	}
 	p.engineMu.Lock()
-	cstr := CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
+	var cstr uintptr
+	if CppTranscribePcmBatchJSONLang != nil {
+		cstr = CppTranscribePcmBatchJSONLang(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec, lang)
+	} else {
+		cstr = CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
+	}
 	p.engineMu.Unlock()
 	if cstr == 0 {
 		err := fmt.Errorf("parakeet-cpp: batch transcribe failed: %s", CppLastError(p.ctxPtr))
@@ -226,8 +249,9 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
 // OpenAI API, whose default is segment-level); token ids always populate
 // Segment.Tokens.
 //
-// translate/diarize/prompt/temperature/language/threads are not applicable to
-// parakeet and are ignored; streaming is handled by AudioTranscriptionStream
+// translate/diarize/prompt/temperature/threads are not applicable to parakeet
+// and are ignored; language is honored on the batched + streaming paths (see
+// opts.GetLanguage() below); streaming is handled by AudioTranscriptionStream
 // (L2).
 func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
 	if p.ctxPtr == 0 {
@@ -271,7 +295,7 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
 	}
 	rep := make(chan batchReply, 1)
 	select {
-	case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, reply: rep}:
+	case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, language: opts.GetLanguage(), reply: rep}:
 	case <-ctx.Done():
 		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
 	}
@@ -361,7 +385,12 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
 		return status.Error(codes.Canceled, "transcription cancelled")
 	}

-	stream := CppStreamBegin(p.ctxPtr)
+	var stream uintptr
+	if CppStreamBeginLang != nil {
+		stream = CppStreamBeginLang(p.ctxPtr, opts.GetLanguage())
+	} else {
+		stream = CppStreamBegin(p.ctxPtr)
+	}
 	if stream == 0 {
 		// Not a cache-aware streaming model: run a normal offline
 		// transcription and emit it as one delta + a closing final result.
--- a/backend/go/parakeet-cpp/main.go
+++ b/backend/go/parakeet-cpp/main.go
@@ -65,6 +65,17 @@ func main() {
 		purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
 	}

+	// Per-request language variants (multilingual nemotron). Same probe pattern:
+	// present only in libparakeet.so built with multilingual support, so the
+	// backend still loads against an older library and falls back to the
+	// non-lang batched + streaming entry points (model default / "auto").
+	if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json_lang"); err == nil && sym != 0 {
+		purego.RegisterLibFunc(&CppTranscribePcmBatchJSONLang, lib, "parakeet_capi_transcribe_pcm_batch_json_lang")
+	}
+	if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_begin_lang"); err == nil && sym != 0 {
+		purego.RegisterLibFunc(&CppStreamBeginLang, lib, "parakeet_capi_stream_begin_lang")
+	}
+
 	fmt.Fprintf(os.Stderr, "[parakeet-cpp] ABI=%d\n", CppAbiVersion())

 	flag.Parse()
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -1,4 +1,57 @@
 ---
+- name: "gemma-4-12b-it-qat-q4_0"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf
+  description: |
+    Hugging Face |
+    GitHub |
+    Launch Blog |
+    Documentation
+
+    License: Apache 2.0 | Authors: Google DeepMind
+
+    > [!Note]
+    > This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model.
+    > Four versions of the QAT checkpoints are available:
+    > * **Unquantized QAT checkpoints** (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models.
+    > * **GGUF** (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B.
+    > * **Mobile-optimized** (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B.
+    > * **Compressed Tensors** (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B
+
+    ...
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-12B-it-qat-q4_0-gguf/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-12B-it-qat-q4_0-gguf/gemma-4-12b-it-qat-q4_0.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-12B-it-qat-q4_0-gguf/gemma-4-12b-it-qat-q4_0.gguf
+      sha256: faff1a63667fac17ac5e777f47114688fcefea96e220e211aaa8d62c2c4561f1
+      uri: https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/gemma-4-12b-it-qat-q4_0.gguf
+    - filename: llama-cpp/mmproj/gemma-4-12B-it-qat-q4_0-gguf/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+      sha256: e70b0e5cd80323d5d588b4ed06780356b7b1ba03995a4b8164c6ae9db0ff5989
+      uri: https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/mmproj-gemma-4-12b-it-qat-q4_0.gguf
 - name: "step-3.7-flash"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
@@ -31887,6 +31940,41 @@
    - filename: parakeet-cpp/tdt_ctc-1.1b-f16.gguf
      uri: huggingface://mudler/parakeet-cpp-gguf/tdt_ctc-1.1b-f16.gguf
      sha256: cd53f64eefac2623a12f2f118ef50b56622dc3012f42c815c6adf0d08292f387
+- name: parakeet-cpp-nemotron-3.5-asr-streaming-0.6b
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/mudler/parakeet-cpp-gguf
+    - https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b
+    - https://github.com/mudler/parakeet.cpp
+  description: |
+    Multilingual (40+ locales), prompt-conditioned, cache-aware streaming FastConformer RNN-T, 0.6B.
+    Q8_0 GGUF for the parakeet-cpp backend (C++/ggml port of NVIDIA NeMo). Byte-identical to NeMo at
+    WER 0 offline and streaming, about 2.5x faster than NeMo on CPU with no GPU. Select a language with
+    the request "language" field (for example en, de, es, ja-JP), or leave it empty for automatic
+    detection. License OpenMDW-1.1.
+  license: other
+  tags:
+    - parakeet
+    - parakeet-cpp
+    - nemotron
+    - asr
+    - speech-recognition
+    - stt
+    - multilingual
+    - streaming
+    - gguf
+    - ggml
+  overrides:
+    backend: parakeet-cpp
+    known_usecases:
+      - transcript
+    name: parakeet-cpp-nemotron-3.5-asr-streaming-0.6b
+    parameters:
+      model: parakeet-cpp/nemotron-3.5-asr-streaming-0.6b-q8_0.gguf
+  files:
+    - filename: parakeet-cpp/nemotron-3.5-asr-streaming-0.6b-q8_0.gguf
+      uri: huggingface://mudler/parakeet-cpp-gguf/nemotron-3.5-asr-streaming-0.6b-q8_0.gguf
+      sha256: ba2f13eccd4a5245be728f77e6149bd6a4fdcdd133ff2e08ac6005bcef7a99f1
 - name: parakeet-crispasr
  url: github:mudler/LocalAI/gallery/virtual.yaml@master
  urls:
--- a/go.mod
+++ b/go.mod
@@ -219,8 +219,8 @@ require (
 	github.com/kevinburke/ssh_config v1.2.0 // indirect
 	github.com/labstack/gommon v0.4.2 // indirect
 	github.com/mschoch/smat v0.2.0 // indirect
-	github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87
-	github.com/mudler/localrecall v0.6.1-0.20260507074622-a7724fef6f81 // indirect
+	github.com/mudler/LocalAGI v0.0.0-20260606071251-14aed1ae4336
+	github.com/mudler/localrecall v0.6.3-0.20260606070048-9a3b3321a9cd // indirect
 	github.com/mudler/skillserver v0.0.7-0.20260520220837-a7317cbf9145
 	github.com/olekukonko/tablewriter v0.0.5 // indirect
 	github.com/oxffaa/gopher-parse-sitemap v0.0.0-20191021113419-005d2eb1def4 // indirect
--- a/go.sum
+++ b/go.sum
@@ -966,8 +966,8 @@ github.com/mr-tron/base58 v1.3.0 h1:K6Y13R2h+dku0wOqKtecgRnBUBPrZzLZy5aIj8lCcJI=
 github.com/mr-tron/base58 v1.3.0/go.mod h1:2BuubE67DCSWwVfx37JWNG8emOC0sHEU4/HpcYgCLX8=
 github.com/mschoch/smat v0.2.0 h1:8imxQsjDm8yFEAVBe7azKmKSgzSkZXDuKkSq9374khM=
 github.com/mschoch/smat v0.2.0/go.mod h1:kc9mz7DoBKqDyiRL7VZN8KvXQMWeTaVnttLRXOlotKw=
-github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87 h1:az+2umaD/sT1rRvI3WZHWXjzdJVJHxcyxp0SNYbqlFk=
-github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87/go.mod h1:x77p9W1zKZr+W+UcEwg8/qdp00p4XXOI69wE7WlXZc0=
+github.com/mudler/LocalAGI v0.0.0-20260606071251-14aed1ae4336 h1:iKBkSnpisOvMVxFoYsAObvAuOqXBakRPMD0PWxWG5EE=
+github.com/mudler/LocalAGI v0.0.0-20260606071251-14aed1ae4336/go.mod h1:U+g6u8mF2wQxhkdBl3dr8G4db1cv3n7KTKmraoJ7D0c=
 github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b h1:A74T2Lauvg61KodYqsjTYDY05kPLcW+efVZjd23dghU=
 github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b/go.mod h1:6sfja3lcu2nWRzEc0wwqGNu/eCG3EWgij+8s7xyUeQ4=
 github.com/mudler/edgevpn v0.34.0 h1:qDrD/rCPFY/FdURbXudIZWihVKY4VOX3nMn3CcbeQEU=
@@ -976,8 +976,8 @@ github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc h1:RxwneJl1VgvikiX
 github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc/go.mod h1:O7SwdSWMilAWhBZMK9N9Y/oBDyMMzshE3ju8Xkexwig=
 github.com/mudler/go-processmanager v0.1.1 h1:c/1NRZOZpW8HuFv9RhBG57nQu1oDMRomEHedwBFMlrw=
 github.com/mudler/go-processmanager v0.1.1/go.mod h1:h6kmHUZeafr+k5hRYpGLMzJFH4hItHffgpRo2QIkP+o=
-github.com/mudler/localrecall v0.6.1-0.20260507074622-a7724fef6f81 h1:8D9NJ/ikhsJCxUwbdzIzadw6RqDrW+L0FPqpQQSeux8=
-github.com/mudler/localrecall v0.6.1-0.20260507074622-a7724fef6f81/go.mod h1:28k5n19raUrkuwXkacdNsBlj8yuSnGhpT16tu+2+4dU=
+github.com/mudler/localrecall v0.6.3-0.20260606070048-9a3b3321a9cd h1:trn9D5UHAE6zdRyD2uX04W1tLSslAwozVwcyNTd72Ak=
+github.com/mudler/localrecall v0.6.3-0.20260606070048-9a3b3321a9cd/go.mod h1:28k5n19raUrkuwXkacdNsBlj8yuSnGhpT16tu+2+4dU=
 github.com/mudler/memory v0.0.0-20260406210934-424c1ecf2cf8 h1:Ry8RiWy8fZ6Ff4E7dPmjRsBrnHOnPeOOj2LhCgyjQu0=
 github.com/mudler/memory v0.0.0-20260406210934-424c1ecf2cf8/go.mod h1:EA8Ashhd56o32qN7ouPKFSRUs/Z+LrRCF4v6R2Oarm8=
 github.com/mudler/skillserver v0.0.7-0.20260520220837-a7317cbf9145 h1:z59tA3IDYPt71nzH1jpxeaA1LuDw8aZfpTQFNU43Zb8=
Author	SHA1	Message	Date
Ettore Di Giacinto	df86e8d6d4	ci(turboquant): drop the ROCm/hipblas build flavor The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin: beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate), its llama.cpp base fails to compile the flash-attention MMA f16 kernels for head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero / non-constant static asserts in fattn-mma-f16.cuh). That is a deep ggml-on-ROCm kernel issue, not something a small fork patch can paper over. Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-06 22:43:47 +00:00
Ettore Di Giacinto	67ff7de374	fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate) The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by apply-patches.sh) ports them: - Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks cudaMemcpy3DPeerAsync, per the fork's own comment). - Create the device event in ggml_backend_cuda_device_event_new with the HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the un-aliased plain cudaEventCreate, matching this file's own usage elsewhere. CUDA builds are unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-06 22:03:48 +00:00
Ettore Di Giacinto	d11a152ad3	fix(turboquant): drop obsolete legacy-spec shim after fork rebased The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the upstream common_params_speculative refactor (ggml-org/llama.cpp #22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker (#21962). The old fork-compat shim forced now-wrong legacy code paths, breaking the build with errors like 'struct common_params_speculative has no member named mparams_dft / type' and 'server_context_impl has no member named model'. Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared grpc-server.cpp (stock llama-cpp and the modern fork both take the modern path now), and narrow the one remaining gap (the fork still lacks common_params::checkpoint_min_step) to a dedicated LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by patch-grpc-server.sh. The patch script now only adds the turbo2/3/4 KV-cache types and injects that one macro. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-06 20:39:10 +00:00
Ettore Di Giacinto	3cdd6a8e63	chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-06 20:39:10 +00:00
LocalAI [bot]	03c84cff28	feat(parakeet-cpp): nemotron-3.5-asr multilingual streaming model + request language support (#10199 ) * feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths Reads opts.GetLanguage() and threads it through to the new parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang C-API entry points, both probed with Dlsym so the backend still loads against an older libparakeet.so (falling back to the non-lang paths, i.e. model default). parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so the dispatcher only coalesces same-language requests: a request whose language differs from the batch leader is held as a single carry-over and becomes the leader of the next batch, never dropped and never left waiting (including on shutdown). A new batcher test asserts no dispatched batch is ever mixed-language and that every submitted request still receives a reply. Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0 default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp commit that ships nemotron support plus batched causal subsampling and the batched target_lang C-API. Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 13:53:10 +02:00
LocalAI [bot]	9bc69c9e5f	chore(model gallery): 🤖 add 1 new models via gallery agent (#10200 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-06 13:52:46 +02:00
LocalAI [bot]	1e6c9cfd60	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `6b9de3dbaa21ae95ea80638e5ee836795cc48c93` (#10190 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-06 09:42:43 +02:00
LocalAI [bot]	0e6712f734	chore: ⬆️ Update mudler/parakeet.cpp to `843600590f96a31467a5199f827c253f34c110f7` (#10198 ) chore(parakeet-cpp): bump pin to banded long-audio attention (843600590) Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f (merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer attention with the chunk-matmul construction: long audio now uses O(T*window) attention instead of global O(T^2), fixing the encoder OOM on long clips (~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window. Short clips are unchanged (global path). No C-ABI change. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 09:25:25 +02:00
LocalAI [bot]	0e4cee9a97	chore: bump LocalAGI + localrecall (fix pgvector hybrid search seqscan, #10186 ) (#10192 ) chore: bump LocalAGI and localrecall (index-backed RRF hybrid search) Bumps the agent stack to pull in the PostgreSQL hybrid-search fix: - mudler/localrecall -> v0.6.3-...-9a3b3321a9cd (mudler/LocalRecall#46, merged) - mudler/LocalAGI -> ...-14aed1ae4336 (mudler/LocalAGI#477, merged) localrecall's hybrid search previously sorted on a wrapped scalar similarity expression, which blinded the planner into a full sequential scan over every row and exceeded the statement timeout on large collections, returning an empty result set. It now uses the canonical Reciprocal Rank Fusion pattern (index-backed candidate retrieval + FULL OUTER JOIN + weighted RRF). Fixes #10186 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 09:16:59 +02:00