test(http): cover parseForwarded edge cases; clarify base-url flag group

Adds direct unit coverage for quoted/malformed/multi-element Forwarded headers and regroups the external base URL flag away from auth-only. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
docs: document LOCALAI_BASE_URL and reverse-proxy headers
2026-06-25 00:59:28 -04:00 · 2026-06-24 22:14:25 +00:00 · 2026-06-24 22:08:58 +00:00 · 2026-06-24 22:03:56 +00:00 · 2026-06-24 21:58:21 +00:00 · 2026-06-24 21:57:43 +00:00
22 changed files with 267 additions and 470 deletions
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -4974,13 +4974,6 @@ includeDarwin:
  - backend: "kitten-tts"
    tag-suffix: "-metal-darwin-arm64-kitten-tts"
    build-type: "mps"
-  # vLLM on Apple Silicon via vllm-metal (MLX). The install is custom
-  # (backend/python/vllm/install.sh has a darwin branch); lang stays python so
-  # backend_build_darwin.yml drives it through build-darwin-python-backend ->
-  # scripts/build/python-darwin.sh, which runs the backend's install.sh.
-  - backend: "vllm"
-    tag-suffix: "-metal-darwin-arm64-vllm"
-    build-type: "mps"
  - backend: "liquid-audio"
    tag-suffix: "-metal-darwin-arm64-liquid-audio"
    build-type: "mps"
--- a/.github/bump_vllm_metal.sh
+++ b/.github/bump_vllm_metal.sh
@@ -1,55 +0,0 @@
-#!/bin/bash
-# Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
-# darwin (Apple Silicon) install path. The macOS/Metal build
-# (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
-# version-locked to a specific vLLM source release. install.sh derives that vLLM
-# version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
-# tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
-# which bumps the Linux cu130 wheel pin.
-#
-# This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
-# darwin build can only use the exact vLLM version vllm-metal supports, so it may
-# lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
-set -xe
-REPO=$1   # vllm-project/vllm-metal
-FILE=$2   # backend/python/vllm/install.sh
-VAR=$3    # VLLM_METAL_VERSION (used for the workflow's output file names)
-
-if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
-    echo "usage: $0 <repo> <install-file> <var-name>" >&2
-    exit 1
-fi
-
-# vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
-# /releases/latest returns the newest one (with its cp312 wheel asset).
-LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
-    "https://api.github.com/repos/$REPO/releases/latest" \
-    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
-
-# The coupled vLLM source version lives in vllm-metal's installer at that tag.
-NEW_VLLM_VERSION=$(curl -fsSL \
-    "https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
-    | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
-
-if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
-    echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
-    exit 1
-fi
-
-set +e
-CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
-set -e
-
-# Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
-# time, so there is nothing else to touch. peter-evans/create-pull-request opens
-# no PR on a clean tree, so a no-op rewrite (already current) is safe.
-sed -i "$FILE" \
-    -e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
-
-if [ -z "$CURRENT_TAG" ]; then
-    echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
-    exit 0
-fi
-
-echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
-echo "${LATEST_TAG}" >> "${VAR}_commit.txt"
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -154,39 +154,3 @@ jobs:
          branch: "update/VLLM_VERSION"
          body: ${{ steps.bump.outputs.message }}
          signoff: true
-
-  bump-vllm-metal:
-    # The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
-    # to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
-    # (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
-    # tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
-    # bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-      - name: Bump vllm-metal pin 🔧
-        id: bump
-        run: |
-          bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
-          {
-            echo 'message<<EOF'
-            cat "VLLM_METAL_VERSION_message.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          {
-            echo 'commit<<EOF'
-            cat "VLLM_METAL_VERSION_commit.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
-          title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
-          branch: "update/VLLM_METAL_VERSION"
-          body: ${{ steps.bump.outputs.message }}
-          signoff: true
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -645,7 +645,6 @@
    nvidia-cuda-13: "cuda13-vllm"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm"
    cpu: "cpu-vllm"
-    metal: "metal-vllm"
 - &sglang
  name: "sglang"
  license: apache-2.0
@@ -2930,17 +2929,6 @@
    nvidia-cuda-13: "cuda13-vllm-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-development"
    cpu: "cpu-vllm-development"
-    metal: "metal-vllm-development"
- !!merge <<: *vllm
-  name: "metal-vllm"
-  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vllm"
-  mirrors:
-    - localai/localai-backends:latest-metal-darwin-arm64-vllm
- !!merge <<: *vllm
-  name: "metal-vllm-development"
-  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vllm"
-  mirrors:
-    - localai/localai-backends:master-metal-darwin-arm64-vllm
 - !!merge <<: *vllm
  name: "cuda12-vllm"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
--- a/backend/python/vllm/backend.py
+++ b/backend/python/vllm/backend.py
@@ -457,14 +457,9 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                    except Exception:
                        pass

-                _pl = getattr(last_output, "prompt_logprobs", None) if last_output is not None else None
-                # Some engines accept the prompt_logprobs request but return a
-                # list of all-None entries instead of computing them (observed
-                # with vllm-metal's MLX backend on macOS). Treat that as
-                # unsupported rather than silently scoring every candidate as 0.
-                if not _pl or all(e is None for e in _pl):
-                    context.set_code(grpc.StatusCode.UNIMPLEMENTED)
-                    context.set_details("This backend did not return prompt_logprobs; scoring is unsupported on this engine (e.g. vllm-metal / MLX on macOS).")
+                if last_output is None or not getattr(last_output, "prompt_logprobs", None):
+                    context.set_code(grpc.StatusCode.INTERNAL)
+                    context.set_details("vLLM did not return prompt_logprobs")
                    return backend_pb2.ScoreResponse()

                prompt_logprobs = last_output.prompt_logprobs
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -43,24 +43,6 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi

-# Apple Silicon (Metal/MLX) via vllm-metal.
-# vllm-metal (github.com/vllm-project/vllm-metal) brings vLLM to macOS on Apple
-# Silicon: it registers through vLLM's platform-plugin entry point
-# (metal -> vllm_metal:register), MetalPlatform activates, and the vLLM v1
-# AsyncLLM engine runs on the GPU through MLX. LocalAI's backend.py is UNCHANGED
-# on darwin — AsyncEngineArgs(...) -> AsyncLLMEngine.from_engine_args transparently
-# resolves to the MLX engine (proven on a real M4 / macOS 26.5 against Qwen3-0.6B).
-#
-# vllm-metal REQUIRES Python 3.12, so force the portable CPython before the venv
-# is created (ensureVenv reads PYTHON_VERSION/PYTHON_PATCH/PY_STANDALONE_TAG).
-# The patch + standalone tag mirror the l4t13 cp312 pin — a known-good
-# python-build-standalone release that also ships an aarch64-apple-darwin asset.
-if [ "$(uname -s)" = "Darwin" ]; then
-    PYTHON_VERSION="3.12"
-    PYTHON_PATCH="12"
-    PY_STANDALONE_TAG="20251120"
-fi
-
 # JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
 # (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
 # an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
@@ -75,87 +57,11 @@ if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    PY_STANDALONE_TAG="20251120"
 fi

-# ===================== Apple Silicon (Metal/MLX) =====================
-# Reproduce vllm-metal's upstream installer
-# (curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh)
-# but INTO LocalAI's managed venv (ensureVenv) instead of a throwaway
-# ~/.venv-vllm-metal, so the backend integrates with LocalAI's venv lifecycle
-# (portable CPython, _makeVenvPortable relocation, runtime activation). The
-# normal CUDA/CPU installRequirements is skipped on darwin — there is no
-# macOS/arm64 vLLM wheel on PyPI; vLLM is built from source and the MLX engine
-# is layered on by the vllm-metal wheel.
-if [ "$(uname -s)" = "Darwin" ]; then
-    # Create/activate the portable 3.12 venv. On darwin USE_PIP=true and
-    # PORTABLE_PYTHON=true (set by scripts/build/python-darwin.sh), so this is a
-    # `python -m venv` based, relocatable venv.
-    ensureVenv
-
-    # vllm-metal's installer drives everything through `uv`: building vLLM from
-    # the CPU requirements needs `--index-strategy unsafe-best-match` (mixes the
-    # pytorch CPU channel with PyPI), a flag plain pip does not have. The darwin
-    # venv is pip-based, so bootstrap uv into it. uv honours $VIRTUAL_ENV (set by
-    # libbackend's _activateVenv) and installs into THIS venv — same pattern the
-    # intel branch below relies on.
-    pip install uv
-
-    # The ONLY darwin version pin -- AUTO-BUMPED by .github/bump_vllm_metal.sh,
-    # which tracks vllm-project/vllm-metal releases (NOT vllm/vllm latest). Keep
-    # it as a plain double-quoted assignment on its own line so the bumper's sed
-    # can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
-    # vllm pin (requirements-cublas13-after.txt, bumped independently against
-    # vllm/vllm) until vllm-metal supports a newer vLLM.
-    VLLM_METAL_VERSION="v0.3.0.dev20260622062346"
-
-    # The coupled vLLM source version is whatever this vllm-metal release builds
-    # against -- it declares it in its own installer as `vllm_v=`. Derive it from
-    # the PINNED tag rather than hardcoding a second value that could drift. The
-    # tag is immutable, so this stays reproducible across rebuilds.
-    VLLM_VERSION=$(curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm-metal/${VLLM_METAL_VERSION}/install.sh" \
-        | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -n1 | cut -d'"' -f2)
-    if [ -z "${VLLM_VERSION}" ]; then
-        echo "ERROR: could not derive the vLLM version from vllm-metal ${VLLM_METAL_VERSION}" >&2
-        exit 1
-    fi
-    echo "vllm-metal ${VLLM_METAL_VERSION} builds against vLLM ${VLLM_VERSION}"
-
-    _vllm_src=$(mktemp -d)
-    trap 'rm -rf "${_vllm_src}"' EXIT
-    pushd "${_vllm_src}"
-        # 1) Build vLLM ${VLLM_VERSION} from the release source tarball against
-        #    the CPU requirements. vllm-metal layers its MLX platform plugin on
-        #    top of this exact build.
-        curl -fsSL -o "vllm-${VLLM_VERSION}.tar.gz" \
-            "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}.tar.gz"
-        tar -xzf "vllm-${VLLM_VERSION}.tar.gz"
-        pushd "vllm-${VLLM_VERSION}"
-            uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
-            # -Wno-parentheses: clang on macOS treats one of vLLM's C++ warnings
-            # as an error without it (matches the upstream installer's CXXFLAGS).
-            CXXFLAGS="-Wno-parentheses" uv pip install .
-        popd
-    popd
-
-    # 2) Install the prebuilt vllm-metal wheel for the PINNED release. It pulls
-    #    mlx / mlx-metal as deps and registers the `metal` platform plugin that
-    #    backend.py resolves to at engine-init time. Build the release-asset URL
-    #    deterministically (tag + the cp312/arm64 wheel name) rather than querying
-    #    api.github.com, whose unauthenticated rate limit (60/hr per IP) 403s on
-    #    shared CI runners. The wheel version is the tag without its leading 'v'.
-    _metal_wheel="vllm_metal-${VLLM_METAL_VERSION#v}-cp312-cp312-macosx_11_0_arm64.whl"
-    _metal_wheel_url="https://github.com/vllm-project/vllm-metal/releases/download/${VLLM_METAL_VERSION}/${_metal_wheel}"
-    echo "Installing vllm-metal wheel: ${_metal_wheel_url}"
-    uv pip install "${_metal_wheel_url}"
-
-    # Generate the gRPC stubs (backend_pb2*). installRequirements normally does
-    # this via runProtogen at the end; we skipped installRequirements on darwin,
-    # so call it explicitly here.
-    runProtogen
-
 # Intel XPU has no upstream-published vllm wheels, so we always build vllm
 # from source against torch-xpu and replace the default triton with
 # triton-xpu (matching torch 2.11). Mirrors the upstream procedure:
 # https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/gpu.xpu.inc.md
-elif [ "x${BUILD_TYPE}" == "xintel" ]; then
+if [ "x${BUILD_TYPE}" == "xintel" ]; then
    # Hide requirements-intel-after.txt so installRequirements doesn't
    # try `pip install vllm` (would either fail or grab a non-XPU wheel).
    _intel_after="${backend_dir}/requirements-intel-after.txt"
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -4,7 +4,4 @@
 # instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
 # so uv consults this index alongside PyPI.
 --extra-index-url https://wheels.vllm.ai/0.23.0/cu130
-# VERSION COUPLING: darwin/Apple-Silicon builds use vllm-metal (see install.sh),
-# which pins this exact vLLM version. Bumping vllm here means coordinating with a
-# vllm-metal release that supports the new version, or macOS/Metal builds break.
 vllm==0.23.0
--- a/core/cli/run.go
+++ b/core/cli/run.go
@@ -140,7 +140,7 @@ type RunCMD struct {
 	OIDCIssuer           string `env:"LOCALAI_OIDC_ISSUER" help:"OIDC issuer URL for auto-discovery" group:"auth"`
 	OIDCClientID         string `env:"LOCALAI_OIDC_CLIENT_ID" help:"OIDC Client ID (auto-enables auth)" group:"auth"`
 	OIDCClientSecret     string `env:"LOCALAI_OIDC_CLIENT_SECRET" help:"OIDC Client Secret" group:"auth"`
-	AuthBaseURL          string `env:"LOCALAI_BASE_URL" help:"Base URL for OAuth callbacks (e.g. http://localhost:8080)" group:"auth"`
+	ExternalBaseURL      string `env:"LOCALAI_BASE_URL" help:"External base URL of this instance (e.g. https://localhost:8080). Used for OAuth callbacks and self-referential links (generated images/videos, job status). When unset, derived from X-Forwarded-Proto/Host or Forwarded headers." group:"api"`
 	AuthAdminEmail       string `env:"LOCALAI_ADMIN_EMAIL" help:"Email address to auto-promote to admin role" group:"auth"`
 	AuthRegistrationMode string `env:"LOCALAI_REGISTRATION_MODE" default:"open" help:"Registration mode: 'open' (default), 'approval', or 'invite' (invite code required)" group:"auth"`
 	DisableLocalAuth     bool   `env:"LOCALAI_DISABLE_LOCAL_AUTH" default:"false" help:"Disable local email/password registration and login (use with OAuth/OIDC-only setups)" group:"auth"`
@@ -503,9 +503,6 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 			opts = append(opts, config.WithAuthOIDCClientID(r.OIDCClientID))
 			opts = append(opts, config.WithAuthOIDCClientSecret(r.OIDCClientSecret))
 		}
-		if r.AuthBaseURL != "" {
-			opts = append(opts, config.WithAuthBaseURL(r.AuthBaseURL))
-		}
 		if r.AuthAdminEmail != "" {
 			opts = append(opts, config.WithAuthAdminEmail(r.AuthAdminEmail))
 		}
@@ -523,6 +520,12 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 		}
 	}

+	// Applied unconditionally: the external base URL governs all self-referential
+	// links (not just OAuth callbacks), so it must take effect even when auth is off.
+	if r.ExternalBaseURL != "" {
+		opts = append(opts, config.WithExternalBaseURL(r.ExternalBaseURL))
+	}
+
 	if idleWatchDog || busyWatchDog {
 		opts = append(opts, config.EnableWatchDog)
 		if idleWatchDog {
--- a/core/config/application_config.go
+++ b/core/config/application_config.go
@@ -49,6 +49,13 @@ type ApplicationConfig struct {
 	P2PNetworkID                  string
 	Federated                     bool

+	// ExternalBaseURL is the externally visible base URL of this instance
+	// (scheme+host[:port]), set via LOCALAI_BASE_URL. When non-empty it is
+	// authoritative for every self-referential URL LocalAI emits (OAuth
+	// callbacks, generated image/video links, async job StatusURLs),
+	// overriding proxy-header detection. Empty = derive from request headers.
+	ExternalBaseURL string
+
 	// DisableStats turns off per-request token tracking. By default the
 	// routing module's billing recorder runs in every mode (including
 	// no-auth single-user) so dashboards and `/api/usage` are immediately
@@ -196,7 +203,6 @@ type AuthConfig struct {
 	OIDCIssuer          string // OIDC issuer URL for auto-discovery (e.g. https://accounts.google.com)
 	OIDCClientID        string
 	OIDCClientSecret    string
-	BaseURL             string // for OAuth callback URLs (e.g. "http://localhost:8080")
 	AdminEmail          string // auto-promote to admin on login
 	RegistrationMode    string // "open", "approval" (default when empty), "invite"
 	DisableLocalAuth    bool   // disable local email/password registration and login
@@ -950,9 +956,9 @@ func WithAuthGitHubClientSecret(clientSecret string) AppOption {
 	}
 }

-func WithAuthBaseURL(baseURL string) AppOption {
+func WithExternalBaseURL(url string) AppOption {
 	return func(o *ApplicationConfig) {
-		o.Auth.BaseURL = baseURL
+		o.ExternalBaseURL = url
 	}
 }

--- a/core/config/hardware_defaults.go
+++ b/core/config/hardware_defaults.go
@@ -54,35 +54,8 @@ func (g GPU) IsNVIDIABlackwell() bool {
 	return maj >= 12
 }

-// Compute-buffer headroom guard for the raised physical batch.
-//
-// Raising n_ubatch grows the CUDA *compute buffer* (the scratch for the forward
-// graph), which is allocated PER DEVICE — it does not benefit from a second GPU
-// the way weights or KV (which are split across devices) do. The buffer scales
-// ~linearly with n_ubatch * n_ctx, so a large context turns the GB10-tuned
-// ub2048 into multi-GiB of extra scratch that must fit on a SINGLE card. On a
-// 16 GiB consumer Blackwell with a 200k context that overflows (issue #10485),
-// even though the GB10 it was measured on (128 GiB unified memory) had room.
-//
-// These constants size a conservative guard: only raise the batch when the
-// extra scratch fits the per-device VRAM ceiling.
-const (
-	// computeBufferBytesPerCell approximates the CUDA compute-buffer cost of one
-	// (n_ubatch * n_ctx) cell. Derived from an observed allocation (ub2048 *
-	// ctx204800 ~= 4.5 GiB => ~11 B/cell) and rounded up to 16 for margin, since
-	// the real cost also grows with model width (heads / embedding dim) which we
-	// don't know at config time.
-	computeBufferBytesPerCell = 16
-	// blackwellBatchHeadroomDivisor caps the extra compute buffer from raising the
-	// physical batch at VRAM/divisor. /4 keeps the bulk of a device for weights +
-	// KV, which already dominate VRAM use.
-	blackwellBatchHeadroomDivisor = 4
-)
-
 // PhysicalBatch returns the canonical physical batch (n_batch/n_ubatch) for the
-// given hardware class, ignoring context/VRAM headroom. Use
-// PhysicalBatchForContext when a model context and per-device VRAM are known
-// (the load paths) so the raised batch can't overflow a single device.
+// given hardware, used when the model config leaves batch unset.
 func PhysicalBatch(g GPU) int {
 	if g.IsNVIDIABlackwell() {
 		return BlackwellPhysicalBatch
@@ -90,32 +63,6 @@ func PhysicalBatch(g GPU) int {
 	return DefaultPhysicalBatch
 }

-// PhysicalBatchForContext is PhysicalBatch gated on per-device VRAM headroom for
-// the given context: it only raises the batch above the conservative default
-// when the extra compute buffer (which is allocated on a single device and grows
-// with n_ubatch * n_ctx) fits within blackwellBatchHeadroomDivisor of the GPU's
-// VRAM. g.VRAM must be the PER-DEVICE ceiling (the smallest device on a
-// multi-GPU host), not the summed total — the compute buffer can't be split.
-//
-// VRAM 0 (unknown) stays conservative rather than risk a per-device OOM; the
-// GB10 / unified-memory path reports system RAM, so it still clears the guard.
-func PhysicalBatchForContext(g GPU, ctx int) int {
-	if !g.IsNVIDIABlackwell() {
-		return DefaultPhysicalBatch
-	}
-	if ctx <= 0 {
-		ctx = DefaultContextSize
-	}
-	if g.VRAM == 0 {
-		return DefaultPhysicalBatch
-	}
-	extra := uint64(ctx) * uint64(BlackwellPhysicalBatch-DefaultPhysicalBatch) * computeBufferBytesPerCell
-	if extra <= g.VRAM/blackwellBatchHeadroomDivisor {
-		return BlackwellPhysicalBatch
-	}
-	return DefaultPhysicalBatch
-}
-
 // IsManagedPhysicalBatch reports whether n is a value PhysicalBatch assigns.
 // Callers that re-tune a value chosen by an upstream host (the distributed
 // router correcting the frontend's guess) use this to avoid clobbering an
@@ -175,12 +122,7 @@ func hasParallelOption(opts []string) bool {
 // deterministic device — detection does a live nvidia-smi call.
 var localGPU = func() GPU {
 	vendor, _ := xsysinfo.DetectGPUVendor()
-	// Use the SMALLEST device's VRAM, not the summed total: the parallel-slot
-	// tier and the batch headroom guard both reason about what fits on a single
-	// card, and per-device compute buffers can't be split across GPUs. Summing
-	// two 16 GiB cards into "32 GiB" is what over-provisioned multi-GPU hosts
-	// into OOM (issue #10485).
-	vram, _ := xsysinfo.MinPerGPUVRAM()
+	vram, _ := xsysinfo.TotalAvailableVRAM()
 	return GPU{
 		Vendor:            vendor,
 		ComputeCapability: xsysinfo.NVIDIAComputeCapability(),
@@ -195,20 +137,10 @@ func ApplyHardwareDefaults(cfg *ModelConfig, gpu GPU) {
 	if cfg == nil {
 		return
 	}
-	// Raise the physical batch on Blackwell only when the resulting compute
-	// buffer fits the per-device VRAM at THIS model's context. Leaving Batch at 0
-	// (rather than writing the default 512) preserves the downstream single-pass
-	// sizing in core/backend.EffectiveBatchSize for embedding/score/rerank.
-	if cfg.Batch == 0 {
-		ctx := DefaultContextSize
-		if cfg.ContextSize != nil {
-			ctx = *cfg.ContextSize
-		}
-		if PhysicalBatchForContext(gpu, ctx) == BlackwellPhysicalBatch {
-			cfg.Batch = BlackwellPhysicalBatch
-			xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
-				"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability, "context", ctx, "vram_gib", gpu.VRAM>>30)
-		}
+	if cfg.Batch == 0 && gpu.IsNVIDIABlackwell() {
+		cfg.Batch = BlackwellPhysicalBatch
+		xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
+			"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability)
 	}

 	// Enable concurrent serving by default on a capable GPU: without this the
--- a/core/config/hardware_defaults_internal_test.go
+++ b/core/config/hardware_defaults_internal_test.go
@@ -9,37 +9,26 @@ import (
 // GPU. The detection seam (localGPU) is injected so the path is deterministic
 // without a real GPU.
 var _ = Describe("SetDefaults hardware defaults (single-instance)", func() {
-	const gib = uint64(1) << 30
-
 	var orig func() GPU
 	BeforeEach(func() { orig = localGPU })
 	AfterEach(func() { localGPU = orig })

-	It("sets the physical batch on a local Blackwell GPU with headroom", func() {
-		localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
+	It("sets the physical batch on a local Blackwell GPU", func() {
+		localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
 		cfg := &ModelConfig{}
 		cfg.SetDefaults()
 		Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
 	})

-	It("leaves batch unset when a large context would overflow the device", func() {
-		// Regression guard for issue #10485: 16 GiB consumer Blackwell + ~200k ctx.
-		localGPU = func() GPU { return GPU{ComputeCapability: "12.0", VRAM: 16 * gib} }
-		ctx := 204800
-		cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
-		cfg.SetDefaults()
-		Expect(cfg.Batch).To(Equal(0))
-	})
-
 	It("leaves batch unset on a non-Blackwell local GPU", func() {
-		localGPU = func() GPU { return GPU{ComputeCapability: "8.9", VRAM: 119 * gib} }
+		localGPU = func() GPU { return GPU{ComputeCapability: "8.9"} }
 		cfg := &ModelConfig{}
 		cfg.SetDefaults()
 		Expect(cfg.Batch).To(Equal(0))
 	})

 	It("never overrides an explicit batch", func() {
-		localGPU = func() GPU { return GPU{ComputeCapability: "12.1", VRAM: 119 * gib} }
+		localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
 		cfg := &ModelConfig{}
 		cfg.Batch = 1024
 		cfg.SetDefaults()
--- a/core/config/hardware_defaults_test.go
+++ b/core/config/hardware_defaults_test.go
@@ -7,8 +7,6 @@ import (
 )

 var _ = Describe("Hardware-driven config defaults", func() {
-	const gib = uint64(1) << 30
-
 	DescribeTable("GPU.IsNVIDIABlackwell (sm_12x consumer family)",
 		func(cc string, want bool) {
 			Expect(GPU{ComputeCapability: cc}.IsNVIDIABlackwell()).To(Equal(want))
@@ -37,54 +35,21 @@ var _ = Describe("Hardware-driven config defaults", func() {
 		})
 	})

-	Describe("PhysicalBatchForContext (per-device VRAM headroom)", func() {
-		It("raises the batch when the compute buffer fits the device", func() {
-			// 16 GiB Blackwell with a small context: the extra scratch is tiny.
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 8192)).
-				To(Equal(BlackwellPhysicalBatch))
-		})
-		It("keeps the default batch when a large context would overflow one device", func() {
-			// The issue #10485 case: 16 GiB consumer Blackwell, ~200k context.
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.0", VRAM: 16 * gib}, 204800)).
-				To(Equal(DefaultPhysicalBatch))
-		})
-		It("still raises the batch on a large unified-memory device (GB10)", func() {
-			// GB10 reports system RAM (~119 GiB) as its single device's VRAM.
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1", VRAM: 119 * gib}, 204800)).
-				To(Equal(BlackwellPhysicalBatch))
-		})
-		It("stays conservative when VRAM is unknown", func() {
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "12.1"}, 8192)).
-				To(Equal(DefaultPhysicalBatch))
-		})
-		It("never raises the batch on non-Blackwell", func() {
-			Expect(PhysicalBatchForContext(GPU{ComputeCapability: "9.0", VRAM: 80 * gib}, 8192)).
-				To(Equal(DefaultPhysicalBatch))
-		})
-	})
-
 	Describe("ApplyHardwareDefaults", func() {
-		It("raises an unset batch to 2048 on Blackwell with headroom", func() {
+		It("raises an unset batch to 2048 on Blackwell", func() {
 			cfg := &ModelConfig{}
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
 			Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
 		})
-		It("leaves batch unset when a large context would overflow one device", func() {
-			// Regression guard for issue #10485: 16 GiB card + ~200k context.
-			ctx := 204800
-			cfg := &ModelConfig{LLMConfig: LLMConfig{ContextSize: &ctx}}
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.0", VRAM: 16 * gib})
-			Expect(cfg.Batch).To(Equal(0))
-		})
 		It("leaves batch unset on non-Blackwell", func() {
 			cfg := &ModelConfig{}
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0", VRAM: 119 * gib})
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0"})
 			Expect(cfg.Batch).To(Equal(0))
 		})
 		It("never overrides an explicit batch", func() {
 			cfg := &ModelConfig{}
 			cfg.Batch = 1024
-			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
 			Expect(cfg.Batch).To(Equal(1024))
 		})
 		It("no-ops on nil", func() {
@@ -92,6 +57,8 @@ var _ = Describe("Hardware-driven config defaults", func() {
 		})
 	})

+	const gib = uint64(1) << 30
+
 	DescribeTable("DefaultParallelSlots (by VRAM)",
 		func(vramGiB uint64, want int) {
 			Expect(DefaultParallelSlots(GPU{VRAM: vramGiB * gib})).To(Equal(want))
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -1204,6 +1204,11 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
 	// This ensures gallery-installed and runtime-loaded models get optimal parameters.
 	ApplyInferenceDefaults(cfg, cfg.Name, cfg.Model)

+	// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell).
+	// Uses the local GPU here; in distributed mode the router re-applies the same
+	// heuristics for the selected node's GPU before loading. Explicit config wins.
+	ApplyHardwareDefaults(cfg, localGPU())
+
 	// Apply serving-policy defaults (device-independent): cross-request prefix
 	// caching. Propagates to distributed nodes via the model options.
 	ApplyServingDefaults(cfg)
@@ -1242,16 +1247,6 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
 		cfg.ContextSize = &ctx
 	}
 	runBackendHooks(cfg, lo.modelPath)
-
-	// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell)
-	// LAST, after the context size is fully resolved (explicit config, LoadOptions,
-	// then the GGUF guess inside runBackendHooks): the Blackwell batch guard sizes
-	// the per-device compute buffer against this model's context, so it must see
-	// the final value, not a pre-guess nil. Uses the local GPU here; in distributed
-	// mode the router re-applies the same heuristics for the selected node's GPU
-	// before loading. Explicit config always wins.
-	ApplyHardwareDefaults(cfg, localGPU())
-
 	cfg.syncKnownUsecasesFromString()
 }

--- a/core/http/app.go
+++ b/core/http/app.go
@@ -149,6 +149,18 @@ func API(application *application.Application) (*echo.Echo, error) {
 	// Middleware - StripPathPrefix must be registered early as it uses Rewrite which runs before routing
 	e.Pre(httpMiddleware.StripPathPrefix())

+	// Stamp the configured external base URL into each request context so
+	// middleware.BaseURL can treat it as authoritative for self-referential
+	// links. Registered as Pre so it runs before routing and handlers.
+	if extBaseURL := application.ApplicationConfig().ExternalBaseURL; extBaseURL != "" {
+		e.Pre(func(next echo.HandlerFunc) echo.HandlerFunc {
+			return func(c echo.Context) error {
+				c.Set("_external_base_url", extBaseURL)
+				return next(c)
+			}
+		})
+	}
+
 	e.Pre(middleware.RemoveTrailingSlash())

 	if application.ApplicationConfig().MachineTag != "" {
--- a/core/http/middleware/baseurl.go
+++ b/core/http/middleware/baseurl.go
@@ -55,17 +55,70 @@ func BasePathPrefix(c echo.Context) string {
 // The returned URL is guaranteed to end with `/`.
 // The method should be used in conjunction with the StripPathPrefix middleware.
 func BaseURL(c echo.Context) string {
+	// An explicit external base URL (LOCALAI_BASE_URL) is authoritative for
+	// the origin. The proxy-derived path prefix is still appended so a
+	// reverse-proxy mount point keeps working. Trailing slashes are
+	// normalized via BasePathPrefix, which always starts and ends with "/".
+	if ext, ok := c.Get("_external_base_url").(string); ok && ext != "" {
+		return strings.TrimRight(ext, "/") + BasePathPrefix(c)
+	}
+
+	fwdProto, fwdHost := parseForwarded(c.Request().Header.Get("Forwarded"))
+
 	scheme := "http"
-	if c.Request().Header.Get("X-Forwarded-Proto") == "https" {
+	switch {
+	case c.Request().TLS != nil:
 		scheme = "https"
-	} else if c.Request().TLS != nil {
+	case strings.EqualFold(firstToken(c.Request().Header.Get("X-Forwarded-Proto")), "https"):
+		scheme = "https"
+	case strings.EqualFold(fwdProto, "https"):
 		scheme = "https"
 	}

 	host := c.Request().Host
 	if forwardedHost := c.Request().Header.Get("X-Forwarded-Host"); forwardedHost != "" {
 		host = forwardedHost
+	} else if fwdHost != "" {
+		host = fwdHost
 	}

 	return scheme + "://" + host + BasePathPrefix(c)
 }
+
+// firstToken returns the first comma-separated token of v, trimmed of spaces.
+// Reverse-proxy chains can emit X-Forwarded-Proto as "https,http"; only the
+// first hop (closest to the client) is meaningful for scheme detection.
+func firstToken(v string) string {
+	if i := strings.IndexByte(v, ','); i >= 0 {
+		v = v[:i]
+	}
+	return strings.TrimSpace(v)
+}
+
+// parseForwarded extracts the proto and host directives from the first element
+// of an RFC 7239 Forwarded header (e.g. `for=x;proto=https;host=h, for=y`).
+// Values may be quoted. Returns empty strings when absent or malformed so the
+// caller can fall through to other signals.
+func parseForwarded(header string) (proto, host string) {
+	if header == "" {
+		return "", ""
+	}
+	// Only the first element (closest proxy to the client) matters here.
+	if i := strings.IndexByte(header, ','); i >= 0 {
+		header = header[:i]
+	}
+	for _, directive := range strings.Split(header, ";") {
+		key, value, ok := strings.Cut(strings.TrimSpace(directive), "=")
+		if !ok {
+			continue
+		}
+		value = strings.Trim(strings.TrimSpace(value), `"`)
+		switch strings.ToLower(strings.TrimSpace(key)) {
+		case "proto":
+			proto = value
+		case "host":
+			host = value
+		}
+	}
+	return proto, host
+}
--- a/core/http/middleware/baseurl_test.go
+++ b/core/http/middleware/baseurl_test.go
@@ -135,4 +135,138 @@ var _ = Describe("BaseURL", func() {
 			Entry("missing leading slash", "evil"),
 		)
 	})
+
+	Context("scheme detection hardening", func() {
+		It("treats comma-separated X-Forwarded-Proto as https when first token is https", func() {
+			app := echo.New()
+			actualURL := ""
+			app.GET("/x", func(c echo.Context) error {
+				actualURL = BaseURL(c)
+				return nil
+			})
+			req := httptest.NewRequest("GET", "/x", nil)
+			req.Header.Set("X-Forwarded-Proto", "https,http")
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+			Expect(actualURL).To(Equal("https://example.com/"))
+		})
+
+		It("derives https from the RFC 7239 Forwarded proto directive", func() {
+			app := echo.New()
+			actualURL := ""
+			app.GET("/x", func(c echo.Context) error {
+				actualURL = BaseURL(c)
+				return nil
+			})
+			req := httptest.NewRequest("GET", "/x", nil)
+			req.Header.Set("Forwarded", "for=192.0.2.1;proto=https;host=proxy.example")
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+			Expect(actualURL).To(Equal("https://proxy.example/"))
+		})
+
+		It("prefers X-Forwarded-Host over the Forwarded host directive", func() {
+			app := echo.New()
+			actualURL := ""
+			app.GET("/x", func(c echo.Context) error {
+				actualURL = BaseURL(c)
+				return nil
+			})
+			req := httptest.NewRequest("GET", "/x", nil)
+			req.Header.Set("X-Forwarded-Host", "xfh.example")
+			req.Header.Set("Forwarded", "host=fwd.example;proto=https")
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+			Expect(actualURL).To(Equal("https://xfh.example/"))
+		})
+	})
+
+	Context("explicit external base URL override", func() {
+		It("uses the configured origin over conflicting forwarded headers", func() {
+			app := echo.New()
+			actualURL := ""
+			app.GET("/x", func(c echo.Context) error {
+				c.Set("_external_base_url", "https://192.168.0.13:34567")
+				actualURL = BaseURL(c)
+				return nil
+			})
+			req := httptest.NewRequest("GET", "/x", nil)
+			req.Header.Set("X-Forwarded-Proto", "http")
+			req.Header.Set("X-Forwarded-Host", "internal:8080")
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+			Expect(actualURL).To(Equal("https://192.168.0.13:34567/"))
+		})
+
+		It("combines the configured origin with a detected path prefix", func() {
+			app := echo.New()
+			actualURL := ""
+			app.GET("/hello", func(c echo.Context) error {
+				c.Set("_original_path", "/localai/hello")
+				c.Set("_external_base_url", "https://ext.example")
+				actualURL = BaseURL(c)
+				return nil
+			})
+			req := httptest.NewRequest("GET", "/hello", nil)
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+			Expect(actualURL).To(Equal("https://ext.example/localai/"))
+		})
+
+		It("ignores an empty override", func() {
+			app := echo.New()
+			actualURL := ""
+			app.GET("/x", func(c echo.Context) error {
+				c.Set("_external_base_url", "")
+				actualURL = BaseURL(c)
+				return nil
+			})
+			req := httptest.NewRequest("GET", "/x", nil)
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+			Expect(actualURL).To(Equal("http://example.com/"))
+		})
+	})
+
+	Context("parseForwarded helper", func() {
+		It("parses unquoted proto and host", func() {
+			proto, host := parseForwarded("for=192.0.2.1;proto=https;host=h.example")
+			Expect(proto).To(Equal("https"))
+			Expect(host).To(Equal("h.example"))
+		})
+
+		It("strips quotes around values", func() {
+			proto, host := parseForwarded(`proto="https";host="h.example"`)
+			Expect(proto).To(Equal("https"))
+			Expect(host).To(Equal("h.example"))
+		})
+
+		It("uses only the first element of a multi-element header", func() {
+			proto, host := parseForwarded("proto=https;host=first.example, proto=http;host=second.example")
+			Expect(proto).To(Equal("https"))
+			Expect(host).To(Equal("first.example"))
+		})
+
+		It("returns empty strings for an empty header", func() {
+			proto, host := parseForwarded("")
+			Expect(proto).To(BeEmpty())
+			Expect(host).To(BeEmpty())
+		})
+
+		It("skips directives without a value", func() {
+			proto, host := parseForwarded("proto;host=h.example")
+			Expect(proto).To(BeEmpty())
+			Expect(host).To(Equal("h.example"))
+		})
+	})
+
+	Context("firstToken helper", func() {
+		It("returns the whole trimmed string when there is no comma", func() {
+			Expect(firstToken("  https  ")).To(Equal("https"))
+		})
+
+		It("returns the first trimmed token when there is a comma", func() {
+			Expect(firstToken("https , http")).To(Equal("https"))
+		})
+	})
 })
--- a/core/http/routes/auth.go
+++ b/core/http/routes/auth.go
@@ -268,7 +268,7 @@ func RegisterAuthRoutes(e *echo.Echo, app *application.Application) {
 	// Set up OAuth manager when any OAuth/OIDC provider is configured
 	if appConfig.Auth.GitHubClientID != "" || appConfig.Auth.OIDCClientID != "" {
 		oauthMgr, err := auth.NewOAuthManager(
-			appConfig.Auth.BaseURL,
+			appConfig.ExternalBaseURL,
 			auth.OAuthParams{
 				GitHubClientID:     appConfig.Auth.GitHubClientID,
 				GitHubClientSecret: appConfig.Auth.GitHubClientSecret,
--- a/core/services/nodes/router.go
+++ b/core/services/nodes/router.go
@@ -156,10 +156,7 @@ func applyNodeHardwareDefaults(opts *pb.ModelOptions, node *BackendNode) {
 		VRAM:              node.TotalVRAM,
 	}
 	if config.IsManagedPhysicalBatch(int(opts.NBatch)) {
-		// Gate the raised batch on the selected node's per-device VRAM at this
-		// model's context, so a large context can't overflow the node's compute
-		// buffer (issue #10485). node.TotalVRAM is the node's reported ceiling.
-		opts.NBatch = int32(config.PhysicalBatchForContext(gpu, int(opts.ContextSize)))
+		opts.NBatch = int32(config.PhysicalBatch(gpu))
 	}
 	// Default concurrent serving for the selected node (the frontend that built
 	// the options may have no GPU). Only adds when no parallel option is set.
--- a/core/services/nodes/router_hardware_internal_test.go
+++ b/core/services/nodes/router_hardware_internal_test.go
@@ -8,19 +8,12 @@ import (
 )

 var _ = Describe("applyNodeHardwareDefaults", func() {
-	It("raises a managed default batch on a Blackwell node with headroom", func() {
-		opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 8192}
-		applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1", TotalVRAM: 119 << 30})
+	It("raises a managed default batch on a Blackwell node", func() {
+		opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch}
+		applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.1"})
 		Expect(opts.NBatch).To(BeEquivalentTo(config.BlackwellPhysicalBatch))
 	})

-	It("keeps the default batch when a large context would overflow the node", func() {
-		// Regression guard for issue #10485 on the distributed path.
-		opts := &pb.ModelOptions{NBatch: config.DefaultPhysicalBatch, ContextSize: 204800}
-		applyNodeHardwareDefaults(opts, &BackendNode{GPUComputeCapability: "12.0", TotalVRAM: 16 << 30})
-		Expect(opts.NBatch).To(BeEquivalentTo(config.DefaultPhysicalBatch))
-	})
-
 	It("resets a Blackwell guess on a non-Blackwell node", func() {
 		// frontend (Blackwell) guessed high, but the selected node is not Blackwell
 		opts := &pb.ModelOptions{NBatch: config.BlackwellPhysicalBatch}
--- a/docs/content/advanced/reverse-proxy-tls.md
+++ b/docs/content/advanced/reverse-proxy-tls.md
@@ -14,6 +14,26 @@ When running LocalAI behind a TLS termination reverse proxy, the Web UI may fail

 LocalAI uses the `X-Forwarded-Proto` HTTP header to determine the protocol used by clients. When this header is set to `https`, LocalAI will generate HTTPS URLs for static assets in the Web UI.

+## Running behind a reverse proxy (HTTPS / subpath)
+
+LocalAI does not terminate TLS itself, so HTTPS is provided by a reverse
+proxy in front of it. Self-referential links (generated image and video
+URLs, async job status URLs, OAuth callbacks) need the externally visible
+scheme, host and port.
+
+LocalAI determines these in this order:
+
+1. `LOCALAI_BASE_URL` - if set, it is authoritative for the origin. Set it to
+   the externally visible base URL, e.g. `LOCALAI_BASE_URL=https://localai.example.com`
+   or `https://192.168.0.13:34567`. Recommended whenever links come back with
+   the wrong scheme or host.
+2. Otherwise, the `X-Forwarded-Proto` and `X-Forwarded-Host` headers (or the
+   RFC 7239 `Forwarded` header) sent by the proxy. Ensure your proxy forwards
+   `X-Forwarded-Proto: https`.
+
+A reverse-proxy subpath mount is supported via `X-Forwarded-Prefix`; it is
+appended to `LOCALAI_BASE_URL` when both are present.
+
 ## Required Headers

 Your reverse proxy must forward these headers to LocalAI:
--- a/pkg/xsysinfo/gpu.go
+++ b/pkg/xsysinfo/gpu.go
@@ -129,61 +129,6 @@ func TotalAvailableVRAM() (uint64, error) {
 	return 0, nil
 }

-// MinPerGPUVRAM returns the total VRAM of the SMALLEST GPU on the host (in
-// bytes), or 0 when no per-device VRAM is known. Unlike TotalAvailableVRAM
-// (which sums across devices) this reports a single device's ceiling, which is
-// the right figure for decisions about what must fit on one card: the compute
-// buffer (sized by n_ubatch) and the parallel-slot tier. Summing a multi-GPU
-// host's VRAM over-provisions those into a per-device OOM (issue #10485).
-//
-// Unified-memory devices (GB10, Apple) report system RAM as their single
-// device's VRAM, so they are unaffected.
-func MinPerGPUVRAM() (uint64, error) {
-	// Prefer per-device binary detection (nvidia-smi/rocm-smi report true
-	// per-card VRAM); ghw's per-card memory can reflect NUMA node RAM on some
-	// hosts, which is why TotalAvailableVRAM treats it as a sum.
-	if infos := GetGPUMemoryUsage(); len(infos) > 0 {
-		if v := minNonZeroVRAM(infos); v > 0 {
-			return v, nil
-		}
-	}
-
-	// Fallback: ghw per-card memory, taking the minimum non-zero card.
-	if gpus, err := GPUs(); err == nil {
-		var min uint64
-		for _, gpu := range gpus {
-			if gpu == nil || gpu.Node == nil || gpu.Node.Memory == nil {
-				continue
-			}
-			if b := gpu.Node.Memory.TotalUsableBytes; b > 0 {
-				if u := uint64(b); min == 0 || u < min {
-					min = u
-				}
-			}
-		}
-		if min > 0 {
-			return min, nil
-		}
-	}
-
-	return 0, nil
-}
-
-// minNonZeroVRAM returns the smallest non-zero TotalVRAM across the given GPUs,
-// or 0 when none report VRAM.
-func minNonZeroVRAM(infos []GPUMemoryInfo) uint64 {
-	var min uint64
-	for _, g := range infos {
-		if g.TotalVRAM == 0 {
-			continue
-		}
-		if min == 0 || g.TotalVRAM < min {
-			min = g.TotalVRAM
-		}
-	}
-	return min
-}
-
 func HasGPU(vendor string) bool {
 	gpus, err := GPUs()
 	if err != nil {
--- a/pkg/xsysinfo/minvram_internal_test.go
+++ b/pkg/xsysinfo/minvram_internal_test.go
@@ -1,37 +0,0 @@
-package xsysinfo
-
-import (
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("minNonZeroVRAM", func() {
-	const gib = uint64(1) << 30
-
-	It("returns the smallest device on a multi-GPU host", func() {
-		// Two unequal cards (e.g. RTX 5070 Ti + 5060 Ti, both 16 GiB, or a
-		// mixed pair): the smallest device is the per-card allocation ceiling.
-		infos := []GPUMemoryInfo{
-			{TotalVRAM: 16 * gib},
-			{TotalVRAM: 12 * gib},
-		}
-		Expect(minNonZeroVRAM(infos)).To(Equal(12 * gib))
-	})
-
-	It("ignores devices that report zero VRAM", func() {
-		infos := []GPUMemoryInfo{
-			{TotalVRAM: 0},
-			{TotalVRAM: 24 * gib},
-		}
-		Expect(minNonZeroVRAM(infos)).To(Equal(24 * gib))
-	})
-
-	It("returns the single device's VRAM on a one-GPU host", func() {
-		Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 16 * gib}})).To(Equal(16 * gib))
-	})
-
-	It("returns 0 when no device reports VRAM", func() {
-		Expect(minNonZeroVRAM([]GPUMemoryInfo{{TotalVRAM: 0}})).To(BeZero())
-		Expect(minNonZeroVRAM(nil)).To(BeZero())
-	})
-})
Author	SHA1	Message	Date
Ettore Di Giacinto	46b76cb4ac	test(http): cover parseForwarded edge cases; clarify base-url flag group Adds direct unit coverage for quoted/malformed/multi-element Forwarded headers and regroups the external base URL flag away from auth-only. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 22:14:25 +00:00
Ettore Di Giacinto	15c7ce059a	docs: document LOCALAI_BASE_URL and reverse-proxy headers Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 22:08:58 +00:00
Ettore Di Giacinto	975b54dfc5	feat(config): generalize LOCALAI_BASE_URL to ExternalBaseURL LOCALAI_BASE_URL now sets a single instance-wide external base URL used for OAuth callbacks and all self-referential links. A Pre middleware stamps it into the request context for middleware.BaseURL. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 22:03:56 +00:00
Ettore Di Giacinto	2eec8bfeb9	feat(http): honor explicit external base URL in BaseURL When _external_base_url is set in the request context it dictates the origin (scheme+host+port); the proxy path prefix is still appended. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:58:21 +00:00
Ettore Di Giacinto	d9feac54dc	fix(http): harden BaseURL proxy scheme/host detection Split comma-separated X-Forwarded-Proto and honor the RFC 7239 Forwarded header so generated links use https behind common reverse-proxy setups. Refs #10482 Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:57:43 +00:00