Files
LocalAI/core/config/defaults.go
LocalAI [bot] 1154be5eea fix(config): fall back to DefaultContextSize for unparseable GGUFs; pin NVFP4 gallery context_size (#10563)
The GGUF metadata parser (gpustack/gguf-parser-go) cannot read NVFP4-quantized
GGUFs at all: it errors with "read tensor info 0: This quantized type is
currently unsupported" because NVFP4 is a ggml tensor type it does not know.
When ParseGGUFFile errors, the llama-cpp defaults hook skips guessGGUFFromFile
entirely and the deferred fallback sets the context window to the conservative
GGUFFallbackContextSize (1024). The result: a model that trains to 262144
tokens runs with n_ctx=1024, and every prompt over ~1k tokens fails with
"request (N tokens) exceeds the available context size (1024 tokens)".

Two changes:

- Drop GGUFFallbackContextSize (1024) and fall back to DefaultContextSize
  (4096) in both the GGUF run-estimate path (gguf.go) and the deferred hook
  fallback (hooks_llamacpp.go). 1024 is a sensible floor for a tiny CPU GGUF
  but a footgun for a large, long-context model whose header simply cannot be
  parsed. Strengthen the existing "GGUF unreadable" test to assert the value.

- Set context_size explicitly on the four NVFP4 gallery entries
  (qwen3.6-35b-a3b-nvfp4-mtp, qwopus3.6-27b-v2-mtp-nvfp4,
  qwopus3.6-27b-coder-mtp-nvfp4, qwen3.6-27b-nvfp4-mtp) so the parser failure
  is irrelevant for them. 32768 matches sibling Qwen entries and is safe on
  memory; operators can raise it toward the 262144 train length.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 23:34:52 +02:00

29 lines
1.3 KiB
Go

package config
// Canonical default values.
//
// These are owned here so the two layers that need them share a single source
// of truth: the config tiers (ApplyInference/Hardware/Serving/Generic — which
// *decide* defaults) and core/backend/options.go (which *translates* a
// ModelConfig to the backend wire format and supplies the same fallbacks
// defensively). Previously these were duplicated as literals across both
// packages and had drifted (e.g. n_gpu_layers 9999999 vs 99999999, two batch
// constants of 512). core/backend imports core/config, so backend references
// these; config never imports backend.
const (
// DefaultContextSize is the fallback context window when none is configured
// or estimable from the model. It is also the fallback for a GGUF whose
// metadata yields no usable estimate or that the parser cannot read at all
// (e.g. a quant type it does not know, such as NVFP4): a model-agnostic
// safe default beats a tiny, surprising window that truncates real prompts.
DefaultContextSize = 4096
// DefaultNGPULayers means "offload all layers"; the backend (fit_params)
// clamps to what actually fits in device memory.
DefaultNGPULayers = 99999999
// DefaultFlashAttention is the flash-attention mode default; "auto" lets the
// backend enable it when the model + backend support it.
DefaultFlashAttention = "auto"
)