fix(llama-cpp): terminate tensor/kv override vectors after passthrough

The tensor_buft_overrides padding and the kv/draft override terminators ran before the generic option passthrough, so a passthrough flag (--cpu-moe, --override-tensor, --override-kv, ...) appended a real entry after the null sentinel - tripping the model loader's back().pattern == nullptr assertion (crash) or being silently dropped. Move all three termination/padding blocks to the end of params_parse, after both the named-option loop and common_params_parse have pushed their real entries. Also widen the exit()-flag skip list so --version, --license, --list-devices and --cache-list cannot terminate the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
docs(llama-cpp): document cpu_moe/n_cpu_moe and option passthrough
2026-06-25 00:59:28 -04:00 · 2026-06-24 17:28:22 +00:00 · 2026-06-24 17:13:13 +00:00 · 2026-06-24 17:10:18 +00:00 · 2026-06-24 17:07:14 +00:00
38 changed files with 269 additions and 1116 deletions
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -4974,9 +4974,6 @@ includeDarwin:
  - backend: "kitten-tts"
    tag-suffix: "-metal-darwin-arm64-kitten-tts"
    build-type: "mps"
-  - backend: "liquid-audio"
-    tag-suffix: "-metal-darwin-arm64-liquid-audio"
-    build-type: "mps"
  - backend: "piper"
    tag-suffix: "-metal-darwin-arm64-piper"
    build-type: "metal"
@@ -4993,10 +4990,6 @@ includeDarwin:
    tag-suffix: "-metal-darwin-arm64-sherpa-onnx"
    build-type: "metal"
    lang: "go"
-  - backend: "supertonic"
-    tag-suffix: "-metal-darwin-arm64-supertonic"
-    build-type: "metal"
-    lang: "go"
  - backend: "local-store"
    tag-suffix: "-metal-darwin-arm64-local-store"
    build-type: "metal"
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=d5507e33ae7ee2b7b41475f08044d3bde3b839ee
+IK_LLAMA_VERSION?=7ccf1d209588962b96eacca325b37e9b3e8faf5e
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -37,6 +37,7 @@
 #include "backend.pb.h"
 #include "backend.grpc.pb.h"
 #include "common.h"
+#include "arg.h"
 #include "chat-auto-parser.h"
 #include <getopt.h>
 #include <grpcpp/ext/proto_server_reflection_plugin.h>
@@ -592,6 +593,10 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    params.checkpoint_min_step = 256;
 #endif

+    // Raw upstream llama-server flags collected from any option entry that
+    // starts with '-'. Applied once after the loop via common_params_parse.
+    std::vector<std::string> extra_argv;
+
     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
        std::string opt = request->options(i);
@@ -1080,6 +1085,31 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                } catch (...) {}
            }

+        // --- main model MoE on CPU (upstream --cpu-moe / --n-cpu-moe) ---
+        } else if (!strcmp(optname, "cpu_moe")) {
+            // Bool-style flag: keep all MoE expert weights on CPU.
+            const bool enable = (optval == NULL) ||
+                optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
+                optval_str == "on" || optval_str == "enabled";
+            if (enable) {
+                params.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
+            }
+        } else if (!strcmp(optname, "n_cpu_moe")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n < 0) n = 0;
+                    // Keep override-name storage alive for the lifetime of the
+                    // params struct (mirrors upstream arg.cpp's function-local static).
+                    static std::list<std::string> buft_overrides_main;
+                    for (int i = 0; i < n; ++i) {
+                        buft_overrides_main.push_back(llm_ffn_exps_block_regex(i));
+                        params.tensor_buft_overrides.push_back(
+                            {buft_overrides_main.back().c_str(), ggml_backend_cpu_buffer_type()});
+                    }
+                } catch (...) {}
+            }
+
        // --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
        } else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
            // Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
@@ -1111,6 +1141,30 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                else { cur.push_back(c); }
            }
            if (!cur.empty()) flush(cur);
+
+        // --- generic passthrough: any entry starting with '-' is a raw
+        //     upstream llama-server flag, forwarded verbatim to the parser. ---
+        } else if (optname[0] == '-') {
+            std::string flag = optname;
+            // These flags make upstream's parser exit() (printing usage /
+            // completion), which would kill the backend process. Skip them.
+            if (flag == "-h" || flag == "--help" || flag == "--usage" ||
+                flag == "--version" || flag == "--license" ||
+                flag == "--list-devices" || flag == "-cl" ||
+                flag == "--cache-list" ||
+                flag.rfind("--completion", 0) == 0) {
+                fprintf(stderr,
+                    "[llama-cpp] ignoring passthrough flag that would exit: %s\n",
+                    flag.c_str());
+            } else {
+                extra_argv.push_back(flag);
+                // Preserve the whole value after the first ':' so embedded
+                // colons (e.g. host:port) survive strtok's truncation of optval.
+                auto colon = opt.find(':');
+                if (colon != std::string::npos) {
+                    extra_argv.push_back(opt.substr(colon + 1));
+                }
+            }
        }
    }

@@ -1146,27 +1200,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        }
    }

-    if (!params.kv_overrides.empty()) {
-        params.kv_overrides.emplace_back();
-        params.kv_overrides.back().key[0] = 0;
-    }
-
-    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
-    // Real entries are pushed during option parsing; here we pad/terminate so the
-    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
-    // and so llama_params_fit has the placeholder slots it requires.
-    {
-        const size_t ntbo = llama_max_tensor_buft_overrides();
-        while (params.tensor_buft_overrides.size() < ntbo) {
-            params.tensor_buft_overrides.push_back({nullptr, nullptr});
-        }
-    }
-    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
-    // the main-model handling above.
-    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
-        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
-    }
-
    // TODO: Add yarn

    if (!request->tensorsplit().empty()) {
@@ -1259,6 +1292,69 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            params.sampling.grammar_triggers.push_back(std::move(trigger));
        }
    }
+
+    // Apply any raw upstream flags last so an explicit passthrough flag wins
+    // over the LocalAI-resolved field it maps to (e.g. --ctx-size beats
+    // context_size). This is the same parser llama-server itself uses.
+    if (!extra_argv.empty()) {
+        // common_params_parser_init resets a few fields for the SERVER example
+        // (n_parallel -> -1, use_color). Snapshot n_parallel so an unrelated
+        // passthrough flag can't silently clobber LocalAI's resolved value.
+        const int saved_n_parallel = params.n_parallel;
+
+        std::vector<char *> argv;
+        std::string prog = "llama-server";
+        argv.push_back(prog.data());
+        for (auto & a : extra_argv) {
+            argv.push_back(a.data());
+        }
+
+        // ctx_arg.params is a reference, so this overlays the given flags onto
+        // `params` in place. Returns false on a recoverable parse error (and
+        // self-restores params); may exit() on a hard error, exactly as
+        // passing the same bad flag to llama-server would.
+        if (!common_params_parse((int)argv.size(), argv.data(), params,
+                                 LLAMA_EXAMPLE_SERVER)) {
+            fprintf(stderr,
+                "[llama-cpp] failed to parse passthrough options; ignoring them\n");
+        }
+
+        // Restore n_parallel unless a passthrough flag explicitly set it
+        // (parser_init's reset sentinel for SERVER is -1).
+        if (params.n_parallel == -1) {
+            params.n_parallel = saved_n_parallel;
+        }
+    }
+
+    // Terminate/pad the override vectors only after BOTH the named-option loop
+    // and the generic passthrough (common_params_parse above) have pushed their
+    // real entries, so back() is the null sentinel the model loader asserts on.
+    // Running these before the passthrough let a passthrough flag (--cpu-moe,
+    // --override-tensor, --override-kv, ...) append a real entry after the
+    // sentinel: a GGML_ASSERT crash for tensor_buft_overrides, a silent drop for
+    // kv_overrides. Double-termination is harmless (the while is a no-op if the
+    // passthrough parse already padded; an extra trailing null is ignored).
+
+    if (!params.kv_overrides.empty()) {
+        params.kv_overrides.emplace_back();
+        params.kv_overrides.back().key[0] = 0;
+    }
+
+    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
+    // Real entries are pushed during option parsing; here we pad/terminate so the
+    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
+    // and so llama_params_fit has the placeholder slots it requires.
+    {
+        const size_t ntbo = llama_max_tensor_buft_overrides();
+        while (params.tensor_buft_overrides.size() < ntbo) {
+            params.tensor_buft_overrides.push_back({nullptr, nullptr});
+        }
+    }
+    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
+    // the main-model handling above.
+    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
+        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
+    }
 }


--- a/backend/go/omnivoice-cpp/Makefile
+++ b/backend/go/omnivoice-cpp/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # omnivoice.cpp version
 OMNIVOICE_REPO?=https://github.com/ServeurpersoCom/omnivoice.cpp
-OMNIVOICE_VERSION?=0f37401bebe9b20c0160a888e592108fc1d17607
+OMNIVOICE_VERSION?=96d30169afd5e6bb3fd6a0e9be0eb505bfe81fcd
 SO_TARGET?=libgomnivoicecpp.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/supertonic/helper.go
+++ b/backend/go/supertonic/helper.go
@@ -16,7 +16,6 @@ import (
 	"os"
 	"path/filepath"
 	"regexp"
-	"runtime"
 	"strings"
 	"time"
 	"unicode"
@@ -944,13 +943,7 @@ func InitializeONNXRuntime() error {
 			}
 		}
 		if libPath == "" {
-			// LocalAI: default to the platform-native shared library
-			// extension when nothing else is found (dyld vs ld.so).
-			if runtime.GOOS == "darwin" {
-				libPath = "/usr/local/lib/libonnxruntime.dylib"
-			} else {
-				libPath = "/usr/local/lib/libonnxruntime.so"
-			}
+			libPath = "/usr/local/lib/libonnxruntime.so"
 		}
 	}
 	ort.SetSharedLibraryPath(libPath)
--- a/backend/go/supertonic/package.sh
+++ b/backend/go/supertonic/package.sh
@@ -32,10 +32,6 @@ elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ $(uname -s) = "Darwin" ]; then
-    # macOS: dyld resolves the bundled .dylib via DYLD_LIBRARY_PATH (set in
-    # run.sh); there is no ld.so loader nor glibc to bundle.
-    echo "Detected Darwin"
 else
    echo "Error: Could not detect architecture"
    exit 1
--- a/backend/go/supertonic/run.sh
+++ b/backend/go/supertonic/run.sh
@@ -3,19 +3,12 @@ set -ex

 CURDIR=$(dirname "$(realpath $0)")

-if [ "$(uname)" = "Darwin" ]; then
-	# macOS uses dyld: there is no ld.so loader, and the search path env
-	# var is DYLD_LIBRARY_PATH. ONNX Runtime ships as a .dylib here.
-	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
-	export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.dylib
-else
-	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
-	export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so
+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so

-	if [ -f $CURDIR/lib/ld.so ]; then
-		echo "Using lib/ld.so"
-		exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
-	fi
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
 fi

 exec $CURDIR/supertonic "$@"
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -1284,7 +1284,6 @@
    nvidia-cuda-13: "cuda13-liquid-audio"
    nvidia-cuda-12: "cuda12-liquid-audio"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio"
-    metal: "metal-liquid-audio"
  icon: https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png
 - &qwen-tts
  urls:
@@ -1570,7 +1569,6 @@
    - TTS
  capabilities:
    default: "cpu-supertonic"
-    metal: "metal-supertonic"
 - !!merge <<: *neutts
  name: "neutts-development"
  capabilities:
@@ -4614,7 +4612,6 @@
    nvidia-cuda-13: "cuda13-liquid-audio-development"
    nvidia-cuda-12: "cuda12-liquid-audio-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
-    metal: "metal-liquid-audio-development"
 - !!merge <<: *liquid-audio
  name: "cpu-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-liquid-audio"
@@ -4625,16 +4622,6 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-liquid-audio"
  mirrors:
    - localai/localai-backends:master-cpu-liquid-audio
- !!merge <<: *liquid-audio
-  name: "metal-liquid-audio"
-  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-liquid-audio"
-  mirrors:
-    - localai/localai-backends:latest-metal-darwin-arm64-liquid-audio
- !!merge <<: *liquid-audio
-  name: "metal-liquid-audio-development"
-  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-liquid-audio"
-  mirrors:
-    - localai/localai-backends:master-metal-darwin-arm64-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cuda12-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-liquid-audio"
@@ -5497,7 +5484,6 @@
  name: "supertonic-development"
  capabilities:
    default: "cpu-supertonic-development"
-    metal: "metal-supertonic-development"
 - !!merge <<: *supertonic
  name: "cpu-supertonic"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-supertonic"
@@ -5508,13 +5494,3 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-supertonic"
  mirrors:
    - localai/localai-backends:master-cpu-supertonic
- !!merge <<: *supertonic
-  name: "metal-supertonic"
-  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-supertonic"
-  mirrors:
-    - localai/localai-backends:latest-metal-darwin-arm64-supertonic
- !!merge <<: *supertonic
-  name: "metal-supertonic-development"
-  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-supertonic"
-  mirrors:
-    - localai/localai-backends:master-metal-darwin-arm64-supertonic
--- a/backend/python/liquid-audio/install.sh
+++ b/backend/python/liquid-audio/install.sh
@@ -14,11 +14,5 @@ else
 fi

 # liquid-audio's torch wheels are large; allow upgrades to satisfy transitive pins
-EXTRA_PIP_INSTALL_FLAGS+=" --upgrade"
-# --index-strategy is a uv-only flag. The darwin/MPS build installs with pip
-# (USE_PIP=true in scripts/build/python-darwin.sh), which rejects it. Only add
-# it on the uv path; Linux/CUDA resolution is unchanged.
-if [ "x${USE_PIP:-}" != "xtrue" ]; then
-    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-first-match"
-fi
+EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
 installRequirements
--- a/backend/python/liquid-audio/requirements-mps.txt
+++ b/backend/python/liquid-audio/requirements-mps.txt
@@ -1,4 +1,3 @@
-# MPS (Apple Silicon / Metal) build profile - installed by the darwin CI job.
 torch>=2.8.0
 torchaudio>=2.8.0
 torchcodec>=0.9.1
--- a/core/config/hardware_defaults.go
+++ b/core/config/hardware_defaults.go
@@ -54,35 +54,8 @@ func (g GPU) IsNVIDIABlackwell() bool {
 	return maj >= 12
 }

-// Compute-buffer headroom guard for the raised physical batch.
-//
-// Raising n_ubatch grows the CUDA *compute buffer* (the scratch for the forward
-// graph), which is allocated PER DEVICE — it does not benefit from a second GPU
-// the way weights or KV (which are split across devices) do. The buffer scales
-// ~linearly with n_ubatch * n_ctx, so a large context turns the GB10-tuned
-// ub2048 into multi-GiB of extra scratch that must fit on a SINGLE card. On a
-// 16 GiB consumer Blackwell with a 200k context that overflows (issue #10485),
-// even though the GB10 it was measured on (128 GiB unified memory) had room.
-//
-// These constants size a conservative guard: only raise the batch when the
-// extra scratch fits the per-device VRAM ceiling.
-const (
-	// computeBufferBytesPerCell approximates the CUDA compute-buffer cost of one
-	// (n_ubatch * n_ctx) cell. Derived from an observed allocation (ub2048 *
-	// ctx204800 ~= 4.5 GiB => ~11 B/cell) and rounded up to 16 for margin, since
-	// the real cost also grows with model width (heads / embedding dim) which we
-	// don't know at config time.
-	computeBufferBytesPerCell = 16
-	// blackwellBatchHeadroomDivisor caps the extra compute buffer from raising the
-	// physical batch at VRAM/divisor. /4 keeps the bulk of a device for weights +
-	// KV, which already dominate VRAM use.
-	blackwellBatchHeadroomDivisor = 4
-)
-
 // PhysicalBatch returns the canonical physical batch (n_batch/n_ubatch) for the
-// given hardware class, ignoring context/VRAM headroom. Use
-// PhysicalBatchForContext when a model context and per-device VRAM are known
-// (the load paths) so the raised batch can't overflow a single device.
+// given hardware, used when the model config leaves batch unset.
 func PhysicalBatch(g GPU) int {
 	if g.IsNVIDIABlackwell() {
 		return BlackwellPhysicalBatch
@@ -90,32 +63,6 @@ func PhysicalBatch(g GPU) int {
 	return DefaultPhysicalBatch
 }

-// PhysicalBatchForContext is PhysicalBatch gated on per-device VRAM headroom for
-// the given context: it only raises the batch above the conservative default
-// when the extra compute buffer (which is allocated on a single device and grows
-// with n_ubatch * n_ctx) fits within blackwellBatchHeadroomDivisor of the GPU's
-// VRAM. g.VRAM must be the PER-DEVICE ceiling (the smallest device on a
-// multi-GPU host), not the summed total — the compute buffer can't be split.
-//
-// VRAM 0 (unknown) stays conservative rather than risk a per-device OOM; the
-// GB10 / unified-memory path reports system RAM, so it still clears the guard.
-func PhysicalBatchForContext(g GPU, ctx int) int {
-	if !g.IsNVIDIABlackwell() {
-		return DefaultPhysicalBatch
-	}
-	if ctx <= 0 {
-		ctx = DefaultContextSize
-	}
-	if g.VRAM == 0 {
-		return DefaultPhysicalBatch
-	}
-	extra := uint64(ctx) * uint64(BlackwellPhysicalBatch-DefaultPhysicalBatch) * computeBufferBytesPerCell
-	if extra <= g.VRAM/blackwellBatchHeadroomDivisor {
-		return BlackwellPhysicalBatch
-	}
-	return DefaultPhysicalBatch
-}
-
 // IsManagedPhysicalBatch reports whether n is a value PhysicalBatch assigns.
 // Callers that re-tune a value chosen by an upstream host (the distributed
 // router correcting the frontend's guess) use this to avoid clobbering an
@@ -175,12 +122,7 @@ func hasParallelOption(opts []string) bool {
 // deterministic device — detection does a live nvidia-smi call.
 var localGPU = func() GPU {
 	vendor, _ := xsysinfo.DetectGPUVendor()
-	// Use the SMALLEST device's VRAM, not the summed total: the parallel-slot
-	// tier and the batch headroom guard both reason about what fits on a single
-	// card, and per-device compute buffers can't be split across GPUs. Summing
-	// two 16 GiB cards into "32 GiB" is what over-provisioned multi-GPU hosts
-	// into OOM (issue #10485).
-	vram, _ := xsysinfo.MinPerGPUVRAM()
+	vram, _ := xsysinfo.TotalAvailableVRAM()
 	return GPU{
 		Vendor:            vendor,
 		ComputeCapability: xsysinfo.NVIDIAComputeCapability(),
@@ -195,20 +137,10 @@ func ApplyHardwareDefaults(cfg *ModelConfig, gpu GPU) {
 	if cfg == nil {
 		return
 	}
-	// Raise the physical batch on Blackwell only when the resulting compute
-	// buffer fits the per-device VRAM at THIS model's context. Leaving Batch at 0
-	// (rather than writing the default 512) preserves the downstream single-pass
-	// sizing in core/backend.EffectiveBatchSize for embedding/score/rerank.
-	if cfg.Batch == 0 {
-		ctx := DefaultContextSize
-		if cfg.ContextSize != nil {
-			ctx = *cfg.ContextSize
-		}
-		if PhysicalBatchForContext(gpu, ctx) == BlackwellPhysicalBatch {
-			cfg.Batch = BlackwellPhysicalBatch
-			xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
-				"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability, "context", ctx, "vram_gib", gpu.VRAM>>30)
-		}
+	if cfg.Batch == 0 && gpu.IsNVIDIABlackwell() {
+		cfg.Batch = BlackwellPhysicalBatch
+		xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
+			"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability)
 	}

 	// Enable concurrent serving by default on a capable GPU: without this the
--- a/core/config/hardware_defaults_internal_test.go
+++ b/core/config/hardware_defaults_internal_test.go
@@ -9,37 +9,26 @@ import (
 // GPU. The detection seam (localGPU) is injected so the path is deterministic
 // without a real GPU.
 var _ = Describe("SetDefaults hardware defaults (single-instance)", func() {
-	const gib = uint64(1) << 30
-
 	var orig func() GPU
 	BeforeEach(func() { orig = localGPU })
 	AfterEach(func() { localGPU = orig })

-	It("sets the physical batch on a local Blackwell GPU with headroom", func() {
-		localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
+	It("sets the physical batch on a local Blackwell GPU", func() {
+		localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
 		cfg := &ModelConfig{}
 		cfg.SetDefaults()
 		Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
 	})

-	It("leaves batch unset when a large context would overflow the device", func() {
-		// Regression guard for issue #10485: 16 GiB consumer Blackwell + ~200k ctx.
-		localGPU = func() GPU { return GPU{ComputeCapability: "12.0", VRAM: 16 * gib} }
-		ctx := 204800
-		cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
-		cfg.SetDefaults()
-		Expect(cfg.Batch).To(Equal(0))
-	})
-
 	It("leaves batch unset on a non-Blackwell local GPU", func() {
-		localGPU = func() GPU { return GPU{ComputeCapability: "8.9", VRAM: 119 * gib} }
+		localGPU = func() GPU { return GPU{ComputeCapability: "8.9"} }
 		cfg := &ModelConfig{}
 		cfg.SetDefaults()
 		Expect(cfg.Batch).To(Equal(0))
 	})

 	It("never overrides an explicit batch", func() {
-		localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
+		localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
 		cfg := &ModelConfig{}
 		cfg.Batch = 1024
 		cfg.SetDefaults()
--- a/core/config/hardware_defaults_test.go
+++ b/core/config/hardware_defaults_test.go
@@ -7,8 +7,6 @@ import (
 )

 var _ = Describe("Hardware-driven config defaults", func() {
-	const gib = uint64(1) << 30
-
 	DescribeTable("GPU.IsNVIDIABlackwell (sm_12x consumer family)",
 		func(cc string, want bool) {
 			Expect(GPU{ComputeCapability: cc}.IsNVIDIABlackwell()).To(Equal(want))
@@ -37,54 +35,21 @@ var _ = Describe("Hardware-driven config defaults", func() {
 		})
 	})

-	Describe("PhysicalBatchForContext (per-device VRAM headroom)", func() {
-		It("raises the batch when the compute buffer fits the device", func() {
-			// 16 GiB Blackwell with a small context: the extra scratch is tiny.
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 8192)).
-				To(Equal(BlackwellPhysicalBatch))
-		})
-		It("keeps the default batch when a large context would overflow one device", func() {
-			// The issue #10485 case: 16 GiB consumer Blackwell, ~200k context.
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 204800)).
-				To(Equal(DefaultPhysicalBatch))
-		})
-		It("still raises the batch on a large unified-memory device (GB10)", func() {
-			// GB10 reports system RAM (~119 GiB) as its single device's VRAM.
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1", VRAM: 119 * gib}, 204800)).
-				To(Equal(BlackwellPhysicalBatch))
-		})
-		It("stays conservative when VRAM is unknown", func() {
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1"}, 8192)).
-				To(Equal(DefaultPhysicalBatch))
-		})
-		It("never raises the batch on non-Blackwell", func() {
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "9.0", VRAM: 80 * gib}, 8192)).
-				To(Equal(DefaultPhysicalBatch))
-		})
-	})
-
 	Describe("ApplyHardwareDefaults", func() {
-		It("raises an unset batch to 2048 on Blackwell with headroom", func() {
+		It("raises an unset batch to 2048 on Blackwell", func() {
 			cfg := &ModelConfig{}
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
 			Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
 		})
-		It("leaves batch unset when a large context would overflow one device", func() {
-			// Regression guard for issue #10485: 16 GiB card + ~200k context.
-			ctx := 204800
-			cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.0", VRAM: 16 * gib})
-			Expect(cfg.Batch).To(Equal(0))
-		})
 		It("leaves batch unset on non-Blackwell", func() {
 			cfg := &ModelConfig{}
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0", VRAM: 119 * gib})
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0"})
 			Expect(cfg.Batch).To(Equal(0))
 		})
 		It("never overrides an explicit batch", func() {
 			cfg := &ModelConfig{}
 			cfg.Batch = 1024
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
 			Expect(cfg.Batch).To(Equal(1024))
 		})
 		It("no-ops on nil", func() {
@@ -92,6 +57,8 @@ var _ = Describe("Hardware-driven config defaults", func() {
 		})
 	})

+	const gib = uint64(1) << 30
+
 	DescribeTable("DefaultParallelSlots (by VRAM)",
 		func(vramGiB uint64, want int) {
 			Expect(DefaultParallelSlots(GPU{VRAM: vramGiB * gib})).To(Equal(want))
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -1204,6 +1204,11 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
 	// This ensures gallery-installed and runtime-loaded models get optimal parameters.
 	ApplyInferenceDefaults(cfg, cfg.Name, cfg.Model)

+	// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell).
+	// Uses the local GPU here; in distributed mode the router re-applies the same
+	// heuristics for the selected node's GPU before loading. Explicit config wins.
+	ApplyHardwareDefaults(cfg, localGPU())
+
 	// Apply serving-policy defaults (device-independent): cross-request prefix
 	// caching. Propagates to distributed nodes via the model options.
 	ApplyServingDefaults(cfg)
@@ -1242,16 +1247,6 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
 		cfg.ContextSize = &ctx
 	}
 	runBackendHooks(cfg, lo.modelPath)
-
-	// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell)
-	// LAST, after the context size is fully resolved (explicit config, LoadOptions,
-	// then the GGUF guess inside runBackendHooks): the Blackwell batch guard sizes
-	// the per-device compute buffer against this model's context, so it must see
-	// the final value, not a pre-guess nil. Uses the local GPU here; in distributed
-	// mode the router re-applies the same heuristics for the selected node's GPU
-	// before loading. Explicit config always wins.
-	ApplyHardwareDefaults(cfg, localGPU())
-
 	cfg.syncKnownUsecasesFromString()
 }

--- a/core/http/endpoints/openai/realtime_model.go
+++ b/core/http/endpoints/openai/realtime_model.go
@@ -432,7 +432,7 @@ func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigL
 	if pipeline.SoundDetection == "" {
 		return nil, nil
 	}
-	cfg, err := loadPipelineSubModel(cl, pipeline.SoundDetection, ml.ModelPath)
+	cfg, err := cl.LoadModelConfigFileByName(pipeline.SoundDetection, ml.ModelPath)
 	if err != nil {
 		return nil, fmt.Errorf("failed to load sound detection config: %w", err)
 	}
@@ -443,7 +443,7 @@ func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigL
 }

 func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
-	cfgVAD, err := loadPipelineSubModel(cl, pipeline.VAD, ml.ModelPath)
+	cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
 	if err != nil {

 		return nil, nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -453,7 +453,7 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
 		return nil, nil, fmt.Errorf("failed to validate config: %w", err)
 	}

-	cfgSST, err := loadPipelineSubModel(cl, pipeline.Transcription, ml.ModelPath)
+	cfgSST, err := cl.LoadModelConfigFileByName(pipeline.Transcription, ml.ModelPath)
 	if err != nil {

 		return nil, nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -542,30 +542,11 @@ func buildRealtimeRoutingContext(a *application.Application, sessionID string) *
 	}
 }

-// loadPipelineSubModel loads a pipeline sub-model config by name and follows a
-// single alias hop, so a pipeline that references an alias (e.g. `llm: default`)
-// gets the alias target's full config (Backend, Model, ...) rather than the
-// alias stub with an empty Backend. Without this the alias survives unresolved
-// into model loading and fails downstream — notably in distributed mode with
-// "backend name is empty". Mirrors the top-level alias resolution in
-// core/http/middleware/request.go.
-func loadPipelineSubModel(cl *config.ModelConfigLoader, name, modelPath string) (*config.ModelConfig, error) {
-	cfg, err := cl.LoadModelConfigFileByName(name, modelPath)
-	if err != nil {
-		return nil, err
-	}
-	resolved, _, err := cl.ResolveAlias(cfg)
-	if err != nil {
-		return nil, err
-	}
-	return resolved, nil
-}
-
 // returns and loads either a wrapped model or a model that support audio-to-audio
 func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, evaluator *templates.Evaluator, routing *RealtimeRoutingContext) (Model, error) {
 	xlog.Debug("Creating new model pipeline model", "pipeline", pipeline)

-	cfgVAD, err := loadPipelineSubModel(cl, pipeline.VAD, ml.ModelPath)
+	cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
 	if err != nil {

 		return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -576,7 +557,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 	}

 	// TODO: Do we always need a transcription model? It can be disabled. Note that any-to-any instruction following models don't transcribe as such, so if transcription is required it is a separate process
-	cfgSST, err := loadPipelineSubModel(cl, pipeline.Transcription, ml.ModelPath)
+	cfgSST, err := cl.LoadModelConfigFileByName(pipeline.Transcription, ml.ModelPath)
 	if err != nil {

 		return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -608,7 +589,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 	xlog.Debug("Loading a wrapped model")

 	// Otherwise we want to return a wrapped model, which is a "virtual" model that re-uses other models to perform operations
-	cfgLLM, err := loadPipelineSubModel(cl, pipeline.LLM, ml.ModelPath)
+	cfgLLM, err := cl.LoadModelConfigFileByName(pipeline.LLM, ml.ModelPath)
 	if err != nil {

 		return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -623,7 +604,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 	applyPipelineReasoning(cfgLLM, *pipeline)
 	applyPipelineThinking(cfgLLM, *pipeline)

-	cfgTTS, err := loadPipelineSubModel(cl, pipeline.TTS, ml.ModelPath)
+	cfgTTS, err := cl.LoadModelConfigFileByName(pipeline.TTS, ml.ModelPath)
 	if err != nil {

 		return nil, fmt.Errorf("failed to load backend config: %w", err)
--- a/core/http/endpoints/openai/realtime_model_alias_test.go
+++ b/core/http/endpoints/openai/realtime_model_alias_test.go
@@ -1,52 +0,0 @@
-package openai
-
-import (
-	"os"
-	"path/filepath"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	"github.com/mudler/LocalAI/core/config"
-)
-
-// loadPipelineSubModel must resolve a pipeline sub-model that references an
-// alias (e.g. `llm: default`) one hop to the alias target's full config — so
-// the effective backend is the target's backend, not the empty backend of the
-// alias stub. This mirrors the top-level alias resolution done in
-// core/http/middleware/request.go, which the realtime pipeline previously
-// skipped (failing in distributed mode with "backend name is empty").
-var _ = Describe("loadPipelineSubModel", func() {
-	It("resolves a sub-model alias one hop to the target's config", func() {
-		tmpDir := GinkgoT().TempDir()
-
-		// A real model config with a concrete backend.
-		realLLM := `name: real-llm
-backend: llama-cpp
-parameters:
-  model: real-llm.gguf
-`
-		Expect(os.WriteFile(filepath.Join(tmpDir, "real-llm.yaml"), []byte(realLLM), 0644)).To(Succeed())
-
-		// An alias pointing at the real model.
-		aliasCfg := `name: default
-alias: real-llm
-`
-		Expect(os.WriteFile(filepath.Join(tmpDir, "default.yaml"), []byte(aliasCfg), 0644)).To(Succeed())
-
-		cl := config.NewModelConfigLoader(tmpDir)
-		Expect(cl.LoadModelConfigsFromPath(tmpDir)).To(Succeed())
-
-		// Resolving the alias must follow the hop to the target's full config.
-		resolved, err := loadPipelineSubModel(cl, "default", tmpDir)
-		Expect(err).NotTo(HaveOccurred())
-		Expect(resolved.IsAlias()).To(BeFalse())
-		Expect(resolved.Backend).To(Equal("llama-cpp"))
-
-		// A non-alias name must load unchanged.
-		direct, err := loadPipelineSubModel(cl, "real-llm", tmpDir)
-		Expect(err).NotTo(HaveOccurred())
-		Expect(direct.Backend).To(Equal("llama-cpp"))
-		Expect(direct.Name).To(Equal("real-llm"))
-	})
-})
--- a/core/http/react-ui/public/locales/en/chat.json
+++ b/core/http/react-ui/public/locales/en/chat.json
@@ -86,7 +86,6 @@
  "input": {
    "placeholder": "Message...",
    "attachFile": "Attach file",
-    "send": "Send message",
    "stopGenerating": "Stop generating",
    "canvasTitle": "Canvas — extract code blocks and media into a side panel for preview, copy, and download",
    "canvasLabel": "Canvas",
--- a/core/http/react-ui/public/locales/en/home.json
+++ b/core/http/react-ui/public/locales/en/home.json
@@ -77,21 +77,6 @@
    "noModelsTitle": "No Models Available",
    "noModelsBody": "There are no models installed yet. Ask your administrator to set up models so you can start chatting."
  },
-  "starters": {
-    "title": "Recommended for your hardware",
-    "tier": {
-      "cpu": "CPU-only",
-      "gpu-small": "GPU",
-      "gpu-mid": "GPU",
-      "gpu-large": "GPU"
-    },
-    "cpuNote": "No GPU detected — these small models stay responsive on CPU.",
-    "gpuNote": "Picked to fit your available VRAM with room for context.",
-    "install": "Install",
-    "installing": "Installing",
-    "installStarted": "Installing {{model}}…",
-    "installFailed": "Install failed: {{message}}"
-  },
  "connect": {
    "title": "One endpoint, every API",
    "subtitle": "LocalAI serves its own full API — image & video generation, depth, object detection, reranking, audio, face & voice recognition, and realtime voice over WebRTC and WebSocket. On top of that, a drop-in compatibility layer lets any app built for OpenAI, Anthropic, Ollama or OpenAI Responses talk to it unchanged.",
--- a/core/http/react-ui/public/locales/en/models.json
+++ b/core/http/react-ui/public/locales/en/models.json
@@ -2,16 +2,6 @@
  "title": "Install Models",
  "subtitle": "Browse and install AI models from the gallery",
  "models": "Models",
-  "recommended": {
-    "title": "Recommended for your hardware",
-    "cpuNote": "No GPU detected - small models that stay responsive on CPU.",
-    "gpuNote": "Sized to fit your available VRAM with room for context.",
-    "install": "Install",
-    "installing": "Installing",
-    "installStarted": "Installing {{model}}…",
-    "installFailed": "Install failed: {{message}}",
-    "dismiss": "Dismiss recommendations"
-  },
  "stats": {
    "available": "Available",
    "installed": "Installed"
--- a/core/http/react-ui/src/App.css
+++ b/core/http/react-ui/src/App.css
@@ -6363,130 +6363,6 @@ select.input {
  justify-content: center;
 }

-/* ──────────────────── Home: hardware-aware starter models ──────────────────── */
-
-.home-starters {
-  margin: var(--spacing-lg) 0;
-  padding: var(--spacing-lg);
-}
-.home-starters-head {
-  display: flex;
-  align-items: center;
-  justify-content: space-between;
-  gap: var(--spacing-md);
-}
-.home-starters-head strong {
-  font-size: 0.9375rem;
-}
-.home-starters-tier {
-  display: inline-flex;
-  align-items: center;
-  gap: var(--spacing-xs);
-  font-size: 0.75rem;
-  color: var(--color-text-muted);
-}
-.home-starters-sub {
-  margin: var(--spacing-xs) 0 var(--spacing-md);
-  font-size: 0.8125rem;
-  color: var(--color-text-secondary);
-}
-.home-starters-list {
-  list-style: none;
-  margin: 0;
-  padding: 0;
-  display: flex;
-  flex-direction: column;
-  gap: var(--spacing-xs);
-}
-.home-starters-item {
-  display: flex;
-  align-items: center;
-  gap: var(--spacing-md);
-  padding: var(--spacing-xs) 0;
-}
-.home-starters-name {
-  font-weight: 500;
-  font-size: 0.875rem;
-  word-break: break-all;
-}
-.home-starters-badge {
-  font-size: 0.625rem;
-}
-.home-starters-size {
-  margin-left: auto;
-  font-size: 0.75rem;
-  color: var(--color-text-muted);
-  white-space: nowrap;
-}
-
-/* ──────────────────── Models gallery: recommended-for-your-hardware strip ──────────────────── */
-
-.rec-models {
-  margin-bottom: var(--spacing-md);
-  padding: var(--spacing-md) var(--spacing-lg);
-}
-.rec-models-head {
-  display: flex;
-  align-items: flex-start;
-  justify-content: space-between;
-  gap: var(--spacing-md);
-}
-.rec-models-title {
-  display: flex;
-  align-items: center;
-  gap: var(--spacing-sm);
-  flex-wrap: wrap;
-}
-.rec-models-title i {
-  color: var(--color-primary);
-}
-.rec-models-note {
-  font-size: 0.8125rem;
-  color: var(--color-text-secondary);
-}
-.rec-models-dismiss {
-  background: none;
-  border: none;
-  color: var(--color-text-muted);
-  cursor: pointer;
-  padding: 4px;
-  flex-shrink: 0;
-}
-.rec-models-dismiss:hover {
-  color: var(--color-text-primary);
-}
-.rec-models-grid {
-  display: grid;
-  grid-template-columns: repeat(auto-fill, minmax(220px, 1fr));
-  gap: var(--spacing-sm);
-  margin-top: var(--spacing-md);
-}
-.rec-models-item {
-  display: flex;
-  flex-direction: column;
-  gap: var(--spacing-xs);
-  padding: var(--spacing-sm) var(--spacing-md);
-  border: 1px solid var(--color-border-subtle);
-  border-radius: var(--radius-md);
-  background: var(--color-bg-primary);
-}
-.rec-models-item-name {
-  font-weight: 500;
-  font-size: 0.8125rem;
-  word-break: break-all;
-}
-.rec-models-item-meta {
-  display: flex;
-  gap: var(--spacing-sm);
-  font-size: 0.75rem;
-  color: var(--color-text-muted);
-}
-.rec-models-item-fit {
-  display: inline-flex;
-  align-items: center;
-  gap: 4px;
-}
-
 /* ──────────────────── Home: drop-in endpoint / API compatibility ──────────────────── */

 .home-connect {
--- a/core/http/react-ui/src/components/ModelSelector.jsx
+++ b/core/http/react-ui/src/components/ModelSelector.jsx
@@ -1,25 +1,8 @@
-import { useEffect, useMemo, useCallback } from 'react'
+import { useEffect, useMemo } from 'react'
 import { useModels } from '../hooks/useModels'
 import SearchableSelect from './SearchableSelect'
 import { useTranslation } from 'react-i18next'

-// Remember the last model the user picked, keyed by capability, so returning to
-// a page (Home chat box, Image, TTS, Talk...) defaults to that model instead of
-// whatever happens to sort first. Only persisted when a capability key exists —
-// `externalOptions` callers pass no capability and get the old first-item
-// behaviour. localStorage access is wrapped because private-browsing modes throw.
-const LAST_MODEL_PREFIX = 'localai_last_model:'
-
-function readLastModel(capability) {
-  if (!capability) return null
-  try { return localStorage.getItem(LAST_MODEL_PREFIX + capability) } catch { return null }
-}
-
-function writeLastModel(capability, model) {
-  if (!capability || !model) return
-  try { localStorage.setItem(LAST_MODEL_PREFIX + capability, model) } catch { /* ignore */ }
-}
-
 export default function ModelSelector({
  value, onChange, capability, className = '',
  options: externalOptions, loading: externalLoading,
@@ -36,27 +19,16 @@ export default function ModelSelector({
  const isLoading = externalOptions ? (externalLoading || false) : hookLoading
  const isDisabled = isLoading || (externalDisabled || false)

-  // Persist genuine selections so the next visit can restore them.
-  const handleChange = useCallback((next) => {
-    writeLastModel(capability, next)
-    onChange(next)
-  }, [capability, onChange])
-
  useEffect(() => {
    if (modelNames.length > 0 && (!value || !modelNames.includes(value))) {
-      // Prefer the remembered model when it's still available; otherwise fall
-      // back to the first option. Don't re-persist here — auto-select is not a
-      // user choice, and writing back the stored value would be a harmless but
-      // pointless round-trip.
-      const remembered = readLastModel(capability)
-      onChange(remembered && modelNames.includes(remembered) ? remembered : modelNames[0])
+      onChange(modelNames[0])
    }
-  }, [modelNames, value, onChange, capability])
+  }, [modelNames, value, onChange])

  return (
    <SearchableSelect
      value={value || ''}
-      onChange={handleChange}
+      onChange={onChange}
      options={modelNames}
      placeholder={isLoading ? t('selector.loading') : (modelNames.length === 0 ? t('selector.noModels') : t('selector.selectModel'))}
      searchPlaceholder={searchPlaceholder || t('selector.searchPlaceholder')}
--- a/core/http/react-ui/src/components/RecommendedModels.jsx
+++ b/core/http/react-ui/src/components/RecommendedModels.jsx
@@ -1,86 +0,0 @@
-import { useState } from 'react'
-import { useTranslation } from 'react-i18next'
-import { modelsApi } from '../utils/api'
-import { useRecommendedModels, isNvfp4Name } from '../hooks/useRecommendedModels'
-
-const DISMISS_KEY = 'localai_rec_models_dismissed'
-
-// "Recommended for your hardware" strip at the top of the Models gallery. Shares
-// the hardware-fit ranking with the empty-state starter widget via
-// useRecommendedModels, but styled for the gallery page and dismissible (the
-// gallery is a repeat-visit surface, so it shouldn't nag).
-export default function RecommendedModels({ addToast }) {
-  const { t } = useTranslation('models')
-  const { recommended, tier, loading } = useRecommendedModels({ count: 4 })
-  const [installing, setInstalling] = useState(() => new Set())
-  const [dismissed, setDismissed] = useState(() => {
-    try { return localStorage.getItem(DISMISS_KEY) === '1' } catch { return false }
-  })
-
-  if (loading || dismissed) return null
-  if (!recommended || recommended.length === 0) return null
-
-  const dismiss = () => {
-    try { localStorage.setItem(DISMISS_KEY, '1') } catch { /* ignore */ }
-    setDismissed(true)
-  }
-
-  const install = async (name) => {
-    setInstalling(prev => new Set(prev).add(name))
-    try {
-      await modelsApi.install(name)
-      addToast?.(t('recommended.installStarted', { model: name }), 'success')
-    } catch (err) {
-      addToast?.(t('recommended.installFailed', { message: err.message }), 'error')
-      setInstalling(prev => {
-        const next = new Set(prev)
-        next.delete(name)
-        return next
-      })
-    }
-  }
-
-  const isGpu = tier.id !== 'cpu'
-
-  return (
-    <div className="rec-models card">
-      <div className="rec-models-head">
-        <div className="rec-models-title">
-          <i className={`fas ${isGpu ? 'fa-microchip' : 'fa-memory'}`} aria-hidden="true" />
-          <strong>{t('recommended.title')}</strong>
-          <span className="rec-models-note">{isGpu ? t('recommended.gpuNote') : t('recommended.cpuNote')}</span>
-        </div>
-        <button type="button" className="rec-models-dismiss" onClick={dismiss} aria-label={t('recommended.dismiss')} title={t('recommended.dismiss')}>
-          <i className="fas fa-times" aria-hidden="true" />
-        </button>
-      </div>
-      <div className="rec-models-grid">
-        {recommended.map(m => {
-          const busy = installing.has(m.name)
-          return (
-            <div key={m.name} className="rec-models-item">
-              <div className="rec-models-item-name">{m.name}</div>
-              <div className="rec-models-item-meta">
-                {isNvfp4Name(m.name) && <span className="badge badge-info">NVFP4</span>}
-                {m.sizeDisplay && <span>{m.sizeDisplay}</span>}
-                {isGpu && m.vramDisplay && (
-                  <span className="rec-models-item-fit"><i className="fas fa-microchip" aria-hidden="true" /> {m.vramDisplay}</span>
-                )}
-              </div>
-              <button
-                type="button"
-                className="btn btn-primary btn-sm"
-                disabled={busy}
-                onClick={() => install(m.name)}
-              >
-                {busy
-                  ? (<><i className="fas fa-spinner fa-spin" aria-hidden="true" /> {t('recommended.installing')}</>)
-                  : (<><i className="fas fa-download" aria-hidden="true" /> {t('recommended.install')}</>)}
-              </button>
-            </div>
-          )
-        })}
-      </div>
-    </div>
-  )
-}
--- a/core/http/react-ui/src/components/StarterModels.jsx
+++ b/core/http/react-ui/src/components/StarterModels.jsx
@@ -1,129 +0,0 @@
-import { useState } from 'react'
-import { useTranslation } from 'react-i18next'
-import { modelsApi } from '../utils/api'
-import { useRecommendedModels, isNvfp4Name } from '../hooks/useRecommendedModels'
-
-// Static fallback used only when the live gallery / estimates can't be reached
-// (offline, trimmed gallery). The hook is the primary, data-driven path; these
-// are real gallery names kept as a safety net so onboarding never shows nothing.
-// Gemma picks use the QAT (quantization-aware-trained) Q4 builds. NVIDIA boxes
-// get NVFP4 + MTP variants at the mid/large tiers (see NVIDIA below).
-const BASE = {
-  cpu: [
-    { name: 'gemma-4-e2b-it-qat-q4_0', size: '~1.5 GB' },
-    { name: 'qwen3.5-4b-claude-4.6-opus-reasoning-distilled', size: '~2.5 GB' },
-    { name: 'gemma-4-e4b-it-qat-q4_0', size: '~3 GB' },
-    { name: 'lfm2.5-1.2b-instruct', size: '~0.8 GB' },
-  ],
-  'gpu-small': [
-    { name: 'gemma-4-e4b-it-qat-q4_0', size: '~3 GB' },
-    { name: 'lfm2.5-8b-a1b', size: '~5 GB' },
-    { name: 'qwen3.5-9b', size: '~5.5 GB' },
-    { name: 'gemma-4-12b-it-qat-q4_0', size: '~7 GB' },
-  ],
-  'gpu-mid': [
-    { name: 'qwen3.6-27b', size: '~16 GB' },
-    { name: 'qwen3.6-27b-mtp-pi-tune', size: '~16 GB' },
-    { name: 'gemma-4-26b-a4b-it-qat-q4_0', size: '~16 GB' },
-    { name: 'qwen3.5-27b', size: '~16 GB' },
-  ],
-  'gpu-large': [
-    { name: 'qwen3.6-35b-a3b-apex', size: '~20 GB' },
-    { name: 'qwen3.6-35b-a3b-claude-4.6-opus-reasoning-distilled', size: '~20 GB' },
-    { name: 'gemma-4-31b-it-qat-q4_0', size: '~18 GB' },
-    { name: 'qwen3.5-35b-a3b-apex', size: '~20 GB' },
-  ],
-}
-
-// NVIDIA-only overrides: NVFP4 is a Blackwell-optimised 4-bit format paired with
-// MTP (multi-token prediction) for speed. Only the mid/large tiers have these.
-const NVIDIA = {
-  'gpu-mid': [
-    { name: 'qwen3.6-27b-nvfp4-mtp', size: '~14 GB' },
-    { name: 'qwen3.6-27b-mtp-pi-tune', size: '~16 GB' },
-    { name: 'gemma-4-26b-a4b-it-qat-q4_0', size: '~16 GB' },
-    { name: 'qwen3.6-27b', size: '~16 GB' },
-  ],
-  'gpu-large': [
-    { name: 'qwen3.6-35b-a3b-nvfp4-mtp', size: '~18 GB' },
-    { name: 'qwen3.6-27b-nvfp4-mtp', size: '~14 GB' },
-    { name: 'qwen3.6-35b-a3b-apex', size: '~20 GB' },
-    { name: 'gemma-4-31b-it-qat-q4_0', size: '~18 GB' },
-  ],
-}
-
-function fallbackFor(tierId, isNvidia) {
-  if (isNvidia && NVIDIA[tierId]) return NVIDIA[tierId]
-  return BASE[tierId] || BASE.cpu
-}
-
-export default function StarterModels({ addToast, onInstallStarted }) {
-  const { t } = useTranslation('home')
-  const { recommended, tier, isNvidia, loading } = useRecommendedModels({ count: 4 })
-  const [installing, setInstalling] = useState(() => new Set())
-
-  // While the hardware probe + gallery query are in flight, render nothing
-  // rather than flashing fallback content that may be replaced a moment later.
-  if (loading) return null
-
-  // Prefer live recommendations; fall back to the static list only when the
-  // gallery yielded nothing.
-  const items = (recommended && recommended.length > 0)
-    ? recommended.map(r => ({ name: r.name, size: r.sizeDisplay }))
-    : fallbackFor(tier.id, isNvidia)
-
-  if (items.length === 0) return null
-
-  const install = async (name) => {
-    setInstalling(prev => new Set(prev).add(name))
-    try {
-      await modelsApi.install(name)
-      addToast?.(t('starters.installStarted', { model: name }), 'success')
-      onInstallStarted?.(name)
-    } catch (err) {
-      addToast?.(t('starters.installFailed', { message: err.message }), 'error')
-      setInstalling(prev => {
-        const next = new Set(prev)
-        next.delete(name)
-        return next
-      })
-    }
-  }
-
-  return (
-    <section className="home-starters card">
-      <div className="home-starters-head">
-        <strong>{t('starters.title')}</strong>
-        <span className="home-starters-tier">
-          <i className={`fas ${tier.id === 'cpu' ? 'fa-memory' : 'fa-microchip'}`} aria-hidden="true" />
-          {t(`starters.tier.${tier.id}`)}
-        </span>
-      </div>
-      <p className="home-starters-sub">
-        {tier.id === 'cpu' ? t('starters.cpuNote') : t('starters.gpuNote')}
-      </p>
-      <ul className="home-starters-list">
-        {items.map(c => {
-          const busy = installing.has(c.name)
-          return (
-            <li key={c.name} className="home-starters-item">
-              <span className="home-starters-name">{c.name}</span>
-              {isNvfp4Name(c.name) && <span className="badge badge-info home-starters-badge">NVFP4</span>}
-              {c.size && <span className="home-starters-size">{c.size}</span>}
-              <button
-                type="button"
-                className="btn btn-primary btn-sm"
-                disabled={busy}
-                onClick={() => install(c.name)}
-              >
-                {busy
-                  ? (<><i className="fas fa-spinner fa-spin" aria-hidden="true" /> {t('starters.installing')}</>)
-                  : (<><i className="fas fa-download" aria-hidden="true" /> {t('starters.install')}</>)}
-              </button>
-            </li>
-          )
-        })}
-      </ul>
-    </section>
-  )
-}
--- a/core/http/react-ui/src/hooks/usePolling.js
+++ b/core/http/react-ui/src/hooks/usePolling.js
@@ -1,66 +0,0 @@
-import { useEffect, useRef, useCallback } from 'react'
-
-// usePolling runs `fn` immediately and then on a fixed interval, with two
-// behaviours every hand-rolled setInterval in this app was missing:
-//
-//   1. Visibility-aware: the timer pauses while the tab is hidden
-//      (document.hidden) and fires an immediate catch-up poll when the tab
-//      becomes visible again. A backgrounded dashboard no longer hammers the
-//      server every few seconds for data nobody is looking at.
-//   2. Non-overlapping: if `fn` returns a promise that takes longer than the
-//      interval, the next tick waits for it instead of stacking requests.
-//
-// `enabled: false` stops polling entirely (one-shot or gated polls). The
-// returned `refetch` runs `fn` on demand and is stable across renders.
-export function usePolling(fn, intervalMs = 5000, { enabled = true, immediate = true } = {}) {
-  const fnRef = useRef(fn)
-  fnRef.current = fn
-
-  const runningRef = useRef(false)
-  const refetch = useCallback(async () => {
-    // Guard against overlap: a slow poll shouldn't pile up behind a fast timer.
-    if (runningRef.current) return
-    runningRef.current = true
-    try {
-      return await fnRef.current()
-    } finally {
-      runningRef.current = false
-    }
-  }, [])
-
-  useEffect(() => {
-    if (!enabled) return
-    let timer = null
-
-    const tick = () => { refetch() }
-
-    const start = () => {
-      if (timer != null) return
-      timer = setInterval(tick, intervalMs)
-    }
-    const stop = () => {
-      if (timer != null) { clearInterval(timer); timer = null }
-    }
-
-    const onVisibility = () => {
-      if (document.hidden) {
-        stop()
-      } else {
-        // Catch up immediately on return, then resume the cadence.
-        tick()
-        start()
-      }
-    }
-
-    if (immediate) tick()
-    if (!document.hidden) start()
-    document.addEventListener('visibilitychange', onVisibility)
-
-    return () => {
-      stop()
-      document.removeEventListener('visibilitychange', onVisibility)
-    }
-  }, [enabled, intervalMs, immediate, refetch])
-
-  return { refetch }
-}
--- a/core/http/react-ui/src/hooks/useRecommendedModels.js
+++ b/core/http/react-ui/src/hooks/useRecommendedModels.js
@@ -1,108 +0,0 @@
-import { useState, useEffect } from 'react'
-import { modelsApi } from '../utils/api'
-import { useResources } from './useResources'
-
-// Data-driven "recommended for your hardware" model picks. The gallery exposes
-// no popularity/download signal and the list response carries no size, so we:
-//   1. ask the server for chat-capable models in their natural (curated) order,
-//   2. estimate size/VRAM for the top candidates (same endpoint the Models page
-//      uses), and
-//   3. rank by hardware fit — smallest on CPU-only boxes, largest-that-fits on
-//      GPUs (bigger == better quality while still fitting VRAM).
-//
-// Returns `recommended === null` while loading, `[]` when nothing could be
-// resolved (gallery/estimates unavailable) so callers can fall back.
-
-const GB = 1024 * 1024 * 1024
-const DEFAULT_CTX = 4096
-
-// NVFP4 is a Blackwell/NVIDIA-specific 4-bit format — only worth suggesting on
-// NVIDIA hardware, and to be filtered out elsewhere.
-export const isNvfp4Name = (name) => /nvfp4/i.test(name || '')
-
-export function hasNvidiaGpu(resources) {
-  return Array.isArray(resources?.gpus) &&
-    resources.gpus.some(g => (g?.vendor || '').toLowerCase() === 'nvidia')
-}
-
-export function recommendTier(resources) {
-  const isGpu = resources?.type === 'gpu'
-  const vram = resources?.aggregate?.total_memory || 0
-  if (!isGpu || vram <= 0) return { id: 'cpu', vram: 0 }
-  if (vram < 8 * GB) return { id: 'gpu-small', vram }
-  if (vram < 24 * GB) return { id: 'gpu-mid', vram }
-  return { id: 'gpu-large', vram }
-}
-
-function rank(candidates, tier, count, isNvidia) {
-  // NVFP4 only runs on NVIDIA (Blackwell) — drop it everywhere else, and prefer
-  // it on NVIDIA boxes where it's the fastest path.
-  const pool = candidates.filter(c => c.sizeBytes != null && (isNvidia || !isNvfp4Name(c.name)))
-  if (tier.id === 'cpu') {
-    // No GPU: smallest models stay responsive on CPU.
-    return [...pool].sort((a, b) => a.sizeBytes - b.sizeBytes).slice(0, count)
-  }
-  const limit = tier.vram * 0.95
-  const fits = pool.filter(c => c.vramBytes != null && c.vramBytes <= limit)
-  const base = fits.length > 0 ? fits : pool // tiny GPU where nothing fits → fall through to smallest
-  const byPreference = (a, b) => {
-    // On NVIDIA, surface NVFP4 first; then largest-that-fits (best quality).
-    if (isNvidia) {
-      const an = isNvfp4Name(a.name), bn = isNvfp4Name(b.name)
-      if (an !== bn) return an ? -1 : 1
-    }
-    return fits.length > 0 ? b.sizeBytes - a.sizeBytes : a.sizeBytes - b.sizeBytes
-  }
-  return [...base].sort(byPreference).slice(0, count)
-}
-
-export function useRecommendedModels({ count = 4, candidatePool = 10 } = {}) {
-  const { resources } = useResources()
-  const [recommended, setRecommended] = useState(null)
-  const [error, setError] = useState(null)
-
-  const resReady = resources !== null
-  const tier = recommendTier(resources)
-  const isNvidia = hasNvidiaGpu(resources)
-
-  useEffect(() => {
-    if (!resReady) return
-    let cancelled = false
-    setRecommended(null)
-    setError(null)
-    ;(async () => {
-      try {
-        const data = await modelsApi.list({ tag: 'chat', items: candidatePool, page: 1 })
-        // Recommend models the user hasn't installed yet.
-        const models = (data?.models || []).filter(m => !m.installed)
-        const estimated = await Promise.all(models.map(async (m) => {
-          const name = m.name || m.id
-          try {
-            const e = await modelsApi.estimate(name, [DEFAULT_CTX])
-            const ctx = e?.estimates?.[String(DEFAULT_CTX)]
-            return {
-              name,
-              description: m.description,
-              sizeBytes: e?.sizeBytes ?? null,
-              sizeDisplay: e?.sizeDisplay ?? null,
-              vramBytes: ctx?.vramBytes ?? null,
-              vramDisplay: ctx?.vramDisplay ?? null,
-            }
-          } catch {
-            return { name, sizeBytes: null }
-          }
-        }))
-        if (cancelled) return
-        setRecommended(rank(estimated, tier, count, isNvidia))
-      } catch (e) {
-        if (cancelled) return
-        setError(e.message)
-        setRecommended([])
-      }
-    })()
-    return () => { cancelled = true }
-    // tier.id / tier.vram / isNvidia are primitives, so resource polling doesn't re-run this.
-  }, [resReady, tier.id, tier.vram, isNvidia, count, candidatePool])
-
-  return { recommended, tier, isNvidia, error, loading: recommended === null }
-}
--- a/core/http/react-ui/src/hooks/useResources.js
+++ b/core/http/react-ui/src/hooks/useResources.js
@@ -1,11 +1,11 @@
-import { useState, useCallback } from 'react'
+import { useState, useEffect, useCallback, useRef } from 'react'
 import { resourcesApi } from '../utils/api'
-import { usePolling } from './usePolling'

 export function useResources(pollInterval = 5000) {
  const [resources, setResources] = useState(null)
  const [loading, setLoading] = useState(true)
  const [error, setError] = useState(null)
+  const intervalRef = useRef(null)

  const fetchResources = useCallback(async () => {
    try {
@@ -19,10 +19,13 @@ export function useResources(pollInterval = 5000) {
    }
  }, [])

-  // Visibility-aware polling: pauses while the tab is hidden and catches up on
-  // return (see usePolling). Resource stats are pure dashboard data, so there's
-  // no reason to keep fetching them for a backgrounded tab.
-  const { refetch } = usePolling(fetchResources, pollInterval)
+  useEffect(() => {
+    fetchResources()
+    intervalRef.current = setInterval(fetchResources, pollInterval)
+    return () => {
+      if (intervalRef.current) clearInterval(intervalRef.current)
+    }
+  }, [fetchResources, pollInterval])

-  return { resources, loading, error, refetch }
+  return { resources, loading, error, refetch: fetchResources }
 }
--- a/core/http/react-ui/src/pages/AgentChat.jsx
+++ b/core/http/react-ui/src/pages/AgentChat.jsx
@@ -765,10 +765,8 @@ export default function AgentChat() {
            className="chat-send-btn"
            onClick={handleSend}
            disabled={processing || !input.trim()}
-            aria-label="Send message"
-            title="Send message"
          >
-            <i className="fas fa-paper-plane" aria-hidden="true" />
+            <i className="fas fa-paper-plane" />
          </button>
        </div>
      </div>
--- a/core/http/react-ui/src/pages/Chat.jsx
+++ b/core/http/react-ui/src/pages/Chat.jsx
@@ -1427,10 +1427,8 @@ export default function Chat() {
                className="chat-send-btn"
                onClick={handleSend}
                disabled={!input.trim() && files.length === 0}
-                aria-label={t('input.send')}
-                title={t('input.send')}
              >
-                <i className="fas fa-paper-plane" aria-hidden="true" />
+                <i className="fas fa-paper-plane" />
              </button>
            )}
          </div>
--- a/core/http/react-ui/src/pages/Home.jsx
+++ b/core/http/react-ui/src/pages/Home.jsx
@@ -10,7 +10,6 @@ import UnifiedMCPDropdown from '../components/UnifiedMCPDropdown'
 import ConfirmDialog from '../components/ConfirmDialog'
 import HomeConnect from '../components/HomeConnect'
 import { useResources } from '../hooks/useResources'
-import { usePolling } from '../hooks/usePolling'
 import { fileToBase64, backendControlApi, systemApi, modelsApi, mcpApi, nodesApi } from '../utils/api'
 import { API_CONFIG } from '../utils/config'
 import { greetingKey } from '../utils/greeting'
@@ -18,7 +17,6 @@ import StatusPill from '../components/StatusPill'
 import Skeleton from '../components/Skeleton'
 import SectionHeading from '../components/SectionHeading'
 import EmptyState from '../components/EmptyState'
-import StarterModels from '../components/StarterModels'
 import { staggerStyle } from '../hooks/useStagger'

 export default function Home() {
@@ -70,36 +68,40 @@ export default function Home() {
      .catch(() => {})
  }, [])

-  // Poll cluster node data in distributed mode. Visibility-aware + gated on
-  // distributedMode so a non-distributed or backgrounded tab makes no calls.
-  const fetchCluster = useCallback(async () => {
-    try {
-      const data = await nodesApi.list()
-      const nodes = Array.isArray(data) ? data : []
-      const backendNodes = nodes.filter(n => !n.node_type || n.node_type === 'backend')
-      const totalVRAM = backendNodes.reduce((sum, n) => sum + (n.total_vram || 0), 0)
-      const usedVRAM = backendNodes.reduce((sum, n) => {
-        if (n.total_vram && n.available_vram != null) return sum + (n.total_vram - n.available_vram)
-        return sum
-      }, 0)
-      const totalRAM = backendNodes.reduce((sum, n) => sum + (n.total_ram || 0), 0)
-      const usedRAM = backendNodes.reduce((sum, n) => {
-        if (n.total_ram && n.available_ram != null) return sum + (n.total_ram - n.available_ram)
-        return sum
-      }, 0)
-      const isGPU = totalVRAM > 0
-      const healthyCount = backendNodes.filter(n => n.status === 'healthy').length
-      const totalCount = backendNodes.length
-      setClusterData({
-        totalMem: isGPU ? totalVRAM : totalRAM,
-        usedMem: isGPU ? usedVRAM : usedRAM,
-        isGPU,
-        healthyCount,
-        totalCount,
-      })
-    } catch { setClusterData(null) }
-  }, [])
-  usePolling(fetchCluster, 5000, { enabled: distributedMode })
+  // Poll cluster node data in distributed mode
+  useEffect(() => {
+    if (!distributedMode) return
+    const fetchCluster = async () => {
+      try {
+        const data = await nodesApi.list()
+        const nodes = Array.isArray(data) ? data : []
+        const backendNodes = nodes.filter(n => !n.node_type || n.node_type === 'backend')
+        const totalVRAM = backendNodes.reduce((sum, n) => sum + (n.total_vram || 0), 0)
+        const usedVRAM = backendNodes.reduce((sum, n) => {
+          if (n.total_vram && n.available_vram != null) return sum + (n.total_vram - n.available_vram)
+          return sum
+        }, 0)
+        const totalRAM = backendNodes.reduce((sum, n) => sum + (n.total_ram || 0), 0)
+        const usedRAM = backendNodes.reduce((sum, n) => {
+          if (n.total_ram && n.available_ram != null) return sum + (n.total_ram - n.available_ram)
+          return sum
+        }, 0)
+        const isGPU = totalVRAM > 0
+        const healthyCount = backendNodes.filter(n => n.status === 'healthy').length
+        const totalCount = backendNodes.length
+        setClusterData({
+          totalMem: isGPU ? totalVRAM : totalRAM,
+          usedMem: isGPU ? usedVRAM : usedRAM,
+          isGPU,
+          healthyCount,
+          totalCount,
+        })
+      } catch { setClusterData(null) }
+    }
+    fetchCluster()
+    const interval = setInterval(fetchCluster, 5000)
+    return () => clearInterval(interval)
+  }, [distributedMode])

  // Fetch configured models (to know if any exist) and loaded models (currently running)
  const fetchSystemInfo = useCallback(async () => {
@@ -121,7 +123,11 @@ export default function Home() {
    }
  }, [])

-  usePolling(fetchSystemInfo, 5000)
+  useEffect(() => {
+    fetchSystemInfo()
+    const interval = setInterval(fetchSystemInfo, 5000)
+    return () => clearInterval(interval)
+  }, [fetchSystemInfo])

  // Check MCP availability when selected model changes
  useEffect(() => {
@@ -517,8 +523,6 @@ export default function Home() {
            </div>
          </div>

-          <StarterModels addToast={addToast} onInstallStarted={fetchSystemInfo} />
-
          <div className="home-wizard-actions">
            <button className="btn btn-primary" onClick={() => navigate('/app/models')}>
              <i className="fas fa-store" /> {t('wizard.browseGallery')}
--- a/core/http/react-ui/src/pages/Models.jsx
+++ b/core/http/react-ui/src/pages/Models.jsx
@@ -13,7 +13,6 @@ import ConfirmDialog from '../components/ConfirmDialog'
 import GalleryLoader from '../components/GalleryLoader'
 import Toggle from '../components/Toggle'
 import ResponsiveTable from '../components/ResponsiveTable'
-import RecommendedModels from '../components/RecommendedModels'
 import React from 'react'


@@ -302,8 +301,6 @@ export default function Models() {
        }
      />

-      <RecommendedModels addToast={addToast} />
-
      {/* Search */}
      <div className="search-bar" style={{ marginBottom: 'var(--spacing-md)' }}>
        <i className="fas fa-search search-icon" />
--- a/core/http/react-ui/src/pages/Usage.jsx
+++ b/core/http/react-ui/src/pages/Usage.jsx
@@ -24,37 +24,7 @@ function formatNumber(n) {
  return String(n)
 }

-// Opt-in token pricing. LocalAI is self-hosted and has no inherent monetary
-// cost, but multi-user deployments use estimated cost for chargeback/budgeting.
-// Prices are admin-supplied $ per 1M tokens, stored locally (per-browser), and
-// the whole cost surface stays hidden until a non-zero price is set.
-const TOKEN_PRICING_KEY = 'localai_token_pricing'
-
-function loadPricing() {
-  try {
-    const p = JSON.parse(localStorage.getItem(TOKEN_PRICING_KEY) || '{}')
-    return { prompt: Number(p.prompt) || 0, completion: Number(p.completion) || 0 }
-  } catch { return { prompt: 0, completion: 0 } }
-}
-
-function savePricing(p) {
-  try { localStorage.setItem(TOKEN_PRICING_KEY, JSON.stringify(p)) } catch { /* ignore */ }
-}
-
-function pricingEnabled(p) { return (p?.prompt || 0) > 0 || (p?.completion || 0) > 0 }
-
-function costOf(row, p) {
-  return (row.prompt_tokens / 1_000_000) * (p.prompt || 0)
-       + (row.completion_tokens / 1_000_000) * (p.completion || 0)
-}
-
-function formatCost(n) {
-  if (!n) return '$0.00'
-  if (n < 0.01) return '<$0.01'
-  return '$' + n.toFixed(2)
-}
-
-function StatCard({ icon, label, value, muted, text }) {
+function StatCard({ icon, label, value, muted }) {
  return (
    <div className="card" style={{ padding: 'var(--spacing-sm) var(--spacing-md)', flex: '1 1 0', minWidth: 120, opacity: muted ? 0.7 : 1 }}>
      <div style={{ display: 'flex', alignItems: 'center', gap: 6, marginBottom: 2 }}>
@@ -62,7 +32,7 @@ function StatCard({ icon, label, value, muted, text }) {
        <span style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', fontWeight: 500, textTransform: 'uppercase', letterSpacing: '0.03em' }}>{label}</span>
      </div>
      <div style={{ fontSize: '1.375rem', fontWeight: 700, fontFamily: 'var(--font-mono)', color: muted ? 'var(--color-text-secondary)' : 'var(--color-text-primary)' }}>
-        {text != null ? text : `${muted ? '~' : ''}${formatNumber(value)}`}
+        {muted ? '~' : ''}{formatNumber(value)}
      </div>
    </div>
  )
@@ -672,10 +642,6 @@ export default function Usage() {
  const [activeTab, setActiveTab] = useState('models')
  const [quotas, setQuotas] = useState([])
  const [selectedUserId, setSelectedUserId] = useState(null)
-  const [pricing, setPricingState] = useState(loadPricing)
-  const [showPricing, setShowPricing] = useState(false)
-  const setPricing = (p) => { setPricingState(p); savePricing(p) }
-  const costEnabled = pricingEnabled(pricing)

  const fetchUsage = useCallback(async () => {
    setLoading(true)
@@ -777,50 +743,11 @@ export default function Usage() {
          <i className="fas fa-key" style={{ fontSize: '0.7rem' }} /> {t('usage.sources.tab')}
        </button>
        <div style={{ flex: 1 }} />
-        <button
-          className={`btn btn-sm ${costEnabled ? 'btn-primary' : 'btn-secondary'}`}
-          onClick={() => setShowPricing(v => !v)}
-          style={{ gap: 4 }}
-          title="Set token pricing to estimate cost"
-        >
-          <i className="fas fa-dollar-sign" /> {costEnabled ? 'Pricing' : 'Set pricing'}
-        </button>
        <button className="btn btn-secondary btn-sm" onClick={fetchUsage} disabled={loading} style={{ gap: 4 }}>
          <i className={`fas fa-rotate${loading ? ' fa-spin' : ''}`} /> Refresh
        </button>
      </div>

-      {showPricing && (
-        <div className="card" style={{ display: 'flex', alignItems: 'flex-end', gap: 'var(--spacing-md)', flexWrap: 'wrap', padding: 'var(--spacing-md)', marginBottom: 'var(--spacing-md)' }}>
-          <div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
-            <label style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', textTransform: 'uppercase', letterSpacing: '0.03em' }}>Prompt $/1M tokens</label>
-            <input
-              className="input" type="number" min="0" step="0.01" style={{ width: 140 }}
-              value={pricing.prompt || ''}
-              placeholder="0.00"
-              onChange={e => setPricing({ ...pricing, prompt: Number(e.target.value) || 0 })}
-            />
-          </div>
-          <div style={{ display: 'flex', flexDirection: 'column', gap: 2 }}>
-            <label style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)', textTransform: 'uppercase', letterSpacing: '0.03em' }}>Completion $/1M tokens</label>
-            <input
-              className="input" type="number" min="0" step="0.01" style={{ width: 140 }}
-              value={pricing.completion || ''}
-              placeholder="0.00"
-              onChange={e => setPricing({ ...pricing, completion: Number(e.target.value) || 0 })}
-            />
-          </div>
-          {costEnabled && (
-            <button className="btn btn-secondary btn-sm" onClick={() => setPricing({ prompt: 0, completion: 0 })} style={{ gap: 4 }}>
-              <i className="fas fa-times" /> Clear
-            </button>
-          )}
-          <span style={{ fontSize: '0.75rem', color: 'var(--color-text-muted)', flex: '1 1 200px' }}>
-            Estimated cost only. Prices are stored in this browser and applied to recorded token counts.
-          </span>
-        </div>
-      )}
-
      {loading ? (
        <div style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
          <LoadingSpinner size="lg" />
@@ -833,9 +760,6 @@ export default function Usage() {
            <StatCard icon="fas fa-arrow-up" label="Prompt" value={displayTotals.prompt_tokens} />
            <StatCard icon="fas fa-arrow-down" label="Completion" value={displayTotals.completion_tokens} />
            <StatCard icon="fas fa-coins" label="Total" value={displayTotals.total_tokens} />
-            {costEnabled && (
-              <StatCard icon="fas fa-dollar-sign" label="Est. Cost" text={formatCost(costOf(displayTotals, pricing))} />
-            )}
          </div>

          {/* Predictions */}
@@ -865,7 +789,6 @@ export default function Usage() {
                      <th style={{ width: 110 }}>Prompt</th>
                      <th style={{ width: 110 }}>Completion</th>
                      <th style={{ width: 110 }}>Total</th>
-                      {costEnabled && <th style={{ width: 100 }}>Est. Cost</th>}
                      <th style={{ width: 140 }}></th>
                    </tr>
                  </thead>
@@ -877,7 +800,6 @@ export default function Usage() {
                        <td style={monoCell}>{formatNumber(row.prompt_tokens)}</td>
                        <td style={monoCell}>{formatNumber(row.completion_tokens)}</td>
                        <td style={{ ...monoCell, fontWeight: 600 }}>{formatNumber(row.total_tokens)}</td>
-                        {costEnabled && <td style={monoCell}>{formatCost(costOf(row, pricing))}</td>}
                        <td><UsageBar value={row.total_tokens} max={maxTokens} /></td>
                      </tr>
                    ))}
@@ -905,7 +827,6 @@ export default function Usage() {
                      <th style={{ width: 110 }}>Prompt</th>
                      <th style={{ width: 110 }}>Completion</th>
                      <th style={{ width: 110 }}>Total</th>
-                      {costEnabled && <th style={{ width: 100 }}>Est. Cost</th>}
                      <th style={{ width: 110 }}>Proj. Total</th>
                      <th style={{ width: 140 }}></th>
                    </tr>
@@ -928,7 +849,6 @@ export default function Usage() {
                            <td style={monoCell}>{formatNumber(row.prompt_tokens)}</td>
                            <td style={monoCell}>{formatNumber(row.completion_tokens)}</td>
                            <td style={{ ...monoCell, fontWeight: 600 }}>{formatNumber(row.total_tokens)}</td>
-                            {costEnabled && <td style={monoCell}>{formatCost(costOf(row, pricing))}</td>}
                            <td style={{ ...monoCell, color: 'var(--color-text-muted)', fontStyle: 'italic' }}>
                              {up?.predictions ? `~${formatNumber(up.predictions.projectedTotals.total_tokens)}` : '-'}
                            </td>
@@ -936,7 +856,7 @@ export default function Usage() {
                          </tr>
                          {isExpanded && up && (
                            <tr>
-                              <td colSpan={costEnabled ? 9 : 8} style={{ padding: 0, background: 'var(--color-bg-secondary)' }}>
+                              <td colSpan={8} style={{ padding: 0, background: 'var(--color-bg-secondary)' }}>
                                <div style={{ padding: 'var(--spacing-md)' }}>
                                  {up.predictions && (
                                    <div style={{ display: 'grid', gridTemplateColumns: 'repeat(auto-fit, minmax(100px, 1fr))', gap: 'var(--spacing-xs)', marginBottom: 'var(--spacing-sm)' }}>
--- a/core/services/nodes/router.go
+++ b/core/services/nodes/router.go
@@ -156,10 +156,7 @@ func applyNodeHardwareDefaults(opts *pb.ModelOptions, node *BackendNode) {
 		VRAM:              node.TotalVRAM,
 	}
 	if config.IsManagedPhysicalBatch(int(opts.NBatch)) {
-		// Gate the raised batch on the selected node's per-device VRAM at this
-		// model's context, so a large context can't overflow the node's compute
-		// buffer (issue #10485). node.TotalVRAM is the node's reported ceiling.
-		opts.NBatch = int32(config.PhysicalBatchForContext(gpu, int(opts.ContextSize)))
+		opts.NBatch = int32(config.PhysicalBatch(gpu))
 	}
 	// Default concurrent serving for the selected node (the frontend that built
 	// the options may have no GPU). Only adds when no parallel option is set.
--- a/core/services/nodes/router_hardware_internal_test.go
+++ b/core/services/nodes/router_hardware_internal_test.go
@@ -8,19 +8,12 @@ import (
 )

 var _ = Describe("applyNodeHardwareDefaults", func() {
-	It("raises a managed default batch on a Blackwell node with headroom", func() {
-		opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 8192}
-		applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1", TotalVRAM: 119 << 30})
+	It("raises a managed default batch on a Blackwell node", func() {
+		opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch}
+		applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1"})
 		Expect(opts.NBatch).To(BeEquivalentTo(config.BlackwellPhysicalBatch))
 	})

-	It("keeps the default batch when a large context would overflow the node", func() {
-		// Regression guard for issue #10485 on the distributed path.
-		opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 204800}
-		applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.0", TotalVRAM: 16 << 30})
-		Expect(opts.NBatch).To(BeEquivalentTo(config.DefaultPhysicalBatch))
-	})
-
 	It("resets a Blackwell guess on a non-Blackwell node", func() {
 		// frontend (Blackwell) guessed high, but the selected node is not Blackwell
 		opts := &pb.ModelOptions{NBatch: config.BlackwellPhysicalBatch}
--- a/docs/content/advanced/model-configuration.md
+++ b/docs/content/advanced/model-configuration.md
@@ -494,6 +494,39 @@ These llama.cpp options are passed through the `options:` array.
 | `direct_io` / `use_direct_io` | bool | `false` | Open the model with `O_DIRECT` (faster cold loads on NVMe; ignored if not supported). |
 | `verbosity` | int | `3` | llama.cpp internal log verbosity threshold. Higher = more verbose. |
 | `override_tensor` / `tensor_buft_overrides` | string | "" | Per-tensor buffer-type overrides for the main model. Format: `<tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...`. Mirrors the existing `draft_override_tensor` syntax for the draft model. |
+| `cpu_moe` | bool | false | Keep all MoE expert weights of the main model on CPU (upstream `--cpu-moe`). Frees VRAM on large MoE models (DeepSeek, Qwen3 `*-A3B`). |
+| `n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N main-model layers on CPU (upstream `--n-cpu-moe`). |
+
+#### Generic option passthrough
+
+Any `options:` entry whose name starts with `-` is forwarded **verbatim** to
+upstream llama.cpp's own `llama-server` argument parser. This means any flag the
+bundled llama.cpp supports works without LocalAI needing a dedicated option,
+even ones added after your LocalAI version was built. See the upstream
+[server flags reference](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
+
+Format mirrors the rest of the array - `--flag` for a boolean, or `--flag:value`
+for a flag that takes a value. Everything after the first `:` is the value, so
+embedded colons (e.g. `host:port`) are preserved:
+
+```yaml
+options:
+  - "--cpu-moe"                 # boolean flag
+  - "--n-cpu-moe:4"             # flag with a value
+  - "--override-tensor:exps=CPU"
+```
+
+Notes:
+
+- **Precedence:** passthrough flags are applied last, so an explicit flag
+  overrides the LocalAI option it maps to (e.g. `--ctx-size:8192` overrides
+  `context_size`).
+- **Power-user territory:** an invalid flag or value is rejected by the upstream
+  parser exactly as it would be by `llama-server`, which can fail model loading.
+  Prefer the named options above when one exists.
+- Flags that would terminate the process (such as `--help`, `--usage`,
+  `--version`, `--license`, `--list-devices`, `--cache-list`, and
+  `--completion*`) are ignored.

 ### Prompt Caching

--- a/docs/data/version.json
+++ b/docs/data/version.json
@@ -1,3 +1,3 @@
 {
-  "version": "v4.5.0"
+  "version": "v4.4.3"
 }
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -3,7 +3,24 @@
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
    - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF
-  description: "Try LFM • Docs • LEAP • Discord\n\n# LFM2.5-1.2B-Instruct\n\nLFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.\n\n  - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.\n  - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.\n  - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.\n\nFind more information about LFM2.5 in our blog post.\n\n## \U0001F5D2️ Model Details\n\nLFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:\n\n...\n"
+  description: |
+    Try LFM • Docs • LEAP • Discord
+
+    # LFM2.5-1.2B-Instruct
+
+    LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.
+
+      - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket.
+      - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM.
+      - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.
+
+    Find more information about LFM2.5 in our blog post.
+
+    ## 🗒️ Model Details
+
+    LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features:
+
+    ...
  license: "other"
  tags:
    - llm
@@ -825,8 +842,8 @@
      use_tokenizer_template: true
  files:
    - filename: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-GGUF/Qwopus3.6-27B-Coder-MTP-Q4_K_M.gguf
+      sha256: b2898667ed7b2388f0ab7691393833ae777f247492bbe62fdb4b2bd3e3cf3f79
      uri: https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF/resolve/main/Qwopus3.6-27B-Coder-MTP-Q4_K_M.gguf
-      sha256: b2b9180093496da2e00439e3fa23227c591355901bfa579bc6897bbc01b755ef
    - filename: llama-cpp/mmproj/Qwopus3.6-27B-Coder-MTP-GGUF/mmproj-F32.gguf
      sha256: 32f7ea0600c07272547da401d460f8abbd980f3a57b69d6df87be0e2505e0b9c
      uri: https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF/resolve/main/mmproj-F32.gguf
--- a/pkg/xsysinfo/gpu.go
+++ b/pkg/xsysinfo/gpu.go
@@ -129,61 +129,6 @@ func TotalAvailableVRAM() (uint64, error) {
 	return 0, nil
 }

-// MinPerGPUVRAM returns the total VRAM of the SMALLEST GPU on the host (in
-// bytes), or 0 when no per-device VRAM is known. Unlike TotalAvailableVRAM
-// (which sums across devices) this reports a single device's ceiling, which is
-// the right figure for decisions about what must fit on one card: the compute
-// buffer (sized by n_ubatch) and the parallel-slot tier. Summing a multi-GPU
-// host's VRAM over-provisions those into a per-device OOM (issue #10485).
-//
-// Unified-memory devices (GB10, Apple) report system RAM as their single
-// device's VRAM, so they are unaffected.
-func MinPerGPUVRAM() (uint64, error) {
-	// Prefer per-device binary detection (nvidia-smi/rocm-smi report true
-	// per-card VRAM); ghw's per-card memory can reflect NUMA node RAM on some
-	// hosts, which is why TotalAvailableVRAM treats it as a sum.
-	if infos := GetGPUMemoryUsage(); len(infos) > 0 {
-		if v := minNonZeroVRAM(infos); v > 0 {
-			return v, nil
-		}
-	}
-
-	// Fallback: ghw per-card memory, taking the minimum non-zero card.
-	if gpus, err := GPUs(); err == nil {
-		var min uint64
-		for _, gpu := range gpus {
-			if gpu == nil || gpu.Node == nil || gpu.Node.Memory == nil {
-				continue
-			}
-			if b := gpu.Node.Memory.TotalUsableBytes; b > 0 {
-				if u := uint64(b); min == 0 || u < min {
-					min = u
-				}
-			}
-		}
-		if min > 0 {
-			return min, nil
-		}
-	}
-
-	return 0, nil
-}
-
-// minNonZeroVRAM returns the smallest non-zero TotalVRAM across the given GPUs,
-// or 0 when none report VRAM.
-func minNonZeroVRAM(infos []GPUMemoryInfo) uint64 {
-	var min uint64
-	for _, g := range infos {
-		if g.TotalVRAM == 0 {
-			continue
-		}
-		if min == 0 || g.TotalVRAM < min {
-			min = g.TotalVRAM
-		}
-	}
-	return min
-}
-
 func HasGPU(vendor string) bool {
 	gpus, err := GPUs()
 	if err != nil {
--- a/pkg/xsysinfo/minvram_internal_test.go
+++ b/pkg/xsysinfo/minvram_internal_test.go
@@ -1,37 +0,0 @@
-package xsysinfo
-
-import (
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("minNonZeroVRAM", func() {
-	const gib = uint64(1) << 30
-
-	It("returns the smallest device on a multi-GPU host", func() {
-		// Two unequal cards (e.g. RTX 5070 Ti + 5060 Ti, both 16 GiB, or a
-		// mixed pair): the smallest device is the per-card allocation ceiling.
-		infos := []GPUMemoryInfo{
-			{TotalVRAM: 16 * gib},
-			{TotalVRAM: 12 * gib},
-		}
-		Expect(minNonZeroVRAM(infos)).To(Equal(12 * gib))
-	})
-
-	It("ignores devices that report zero VRAM", func() {
-		infos := []GPUMemoryInfo{
-			{TotalVRAM: 0},
-			{TotalVRAM: 24 * gib},
-		}
-		Expect(minNonZeroVRAM(infos)).To(Equal(24 * gib))
-	})
-
-	It("returns the single device's VRAM on a one-GPU host", func() {
-		Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 16 * gib}})).To(Equal(16 * gib))
-	})
-
-	It("returns 0 when no device reports VRAM", func() {
-		Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 0}})).To(BeZero())
-		Expect(minNonZeroVRAM(nil)).To(BeZero())
-	})
-})
Author	SHA1	Message	Date
Ettore Di Giacinto	28beac9a18	fix(llama-cpp): terminate tensor/kv override vectors after passthrough The tensor_buft_overrides padding and the kv/draft override terminators ran before the generic option passthrough, so a passthrough flag (--cpu-moe, --override-tensor, --override-kv, ...) appended a real entry after the null sentinel - tripping the model loader's back().pattern == nullptr assertion (crash) or being silently dropped. Move all three termination/padding blocks to the end of params_parse, after both the named-option loop and common_params_parse have pushed their real entries. Also widen the exit()-flag skip list so --version, --license, --list-devices and --cache-list cannot terminate the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 17:28:22 +00:00
Ettore Di Giacinto	977ccd88f0	docs(llama-cpp): document cpu_moe/n_cpu_moe and option passthrough Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 17:13:13 +00:00
Ettore Di Giacinto	74e6c60045	feat(llama-cpp): forward unknown '-' options to upstream arg parser Any options: entry starting with '-' is collected and passed verbatim to llama.cpp's own common_params_parse (LLAMA_EXAMPLE_SERVER) at the end of params_parse, so every upstream llama-server flag works without a new hand-wired branch. Passthrough runs last and wins on overlap; n_parallel is snapshotted to survive parser_init's SERVER reset, and help/usage/completion flags are skipped to avoid exiting the backend. Closes #10483 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 17:10:18 +00:00
Ettore Di Giacinto	811f0db2e3	feat(llama-cpp): add main-model cpu_moe/n_cpu_moe options Mirror the existing draft_cpu_moe/draft_n_cpu_moe siblings for the main model, matching upstream --cpu-moe / --n-cpu-moe (common/arg.cpp). Lets users keep MoE expert weights on CPU to manage VRAM on large MoE models. Closes part of #10483 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 17:07:14 +00:00