mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 18:06:58 -04:00
The GGUF metadata parser (gpustack/gguf-parser-go) cannot read NVFP4-quantized GGUFs at all: it errors with "read tensor info 0: This quantized type is currently unsupported" because NVFP4 is a ggml tensor type it does not know. When ParseGGUFFile errors, the llama-cpp defaults hook skips guessGGUFFromFile entirely and the deferred fallback sets the context window to the conservative GGUFFallbackContextSize (1024). The result: a model that trains to 262144 tokens runs with n_ctx=1024, and every prompt over ~1k tokens fails with "request (N tokens) exceeds the available context size (1024 tokens)". Two changes: - Drop GGUFFallbackContextSize (1024) and fall back to DefaultContextSize (4096) in both the GGUF run-estimate path (gguf.go) and the deferred hook fallback (hooks_llamacpp.go). 1024 is a sensible floor for a tiny CPU GGUF but a footgun for a large, long-context model whose header simply cannot be parsed. Strengthen the existing "GGUF unreadable" test to assert the value. - Set context_size explicitly on the four NVFP4 gallery entries (qwen3.6-35b-a3b-nvfp4-mtp, qwopus3.6-27b-v2-mtp-nvfp4, qwopus3.6-27b-coder-mtp-nvfp4, qwen3.6-27b-nvfp4-mtp) so the parser failure is irrelevant for them. 32768 matches sibling Qwen entries and is safe on memory; operators can raise it toward the 262144 train length. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
29 lines
1.3 KiB
Go
29 lines
1.3 KiB
Go
package config
|
|
|
|
// Canonical default values.
|
|
//
|
|
// These are owned here so the two layers that need them share a single source
|
|
// of truth: the config tiers (ApplyInference/Hardware/Serving/Generic — which
|
|
// *decide* defaults) and core/backend/options.go (which *translates* a
|
|
// ModelConfig to the backend wire format and supplies the same fallbacks
|
|
// defensively). Previously these were duplicated as literals across both
|
|
// packages and had drifted (e.g. n_gpu_layers 9999999 vs 99999999, two batch
|
|
// constants of 512). core/backend imports core/config, so backend references
|
|
// these; config never imports backend.
|
|
const (
|
|
// DefaultContextSize is the fallback context window when none is configured
|
|
// or estimable from the model. It is also the fallback for a GGUF whose
|
|
// metadata yields no usable estimate or that the parser cannot read at all
|
|
// (e.g. a quant type it does not know, such as NVFP4): a model-agnostic
|
|
// safe default beats a tiny, surprising window that truncates real prompts.
|
|
DefaultContextSize = 4096
|
|
|
|
// DefaultNGPULayers means "offload all layers"; the backend (fit_params)
|
|
// clamps to what actually fits in device memory.
|
|
DefaultNGPULayers = 99999999
|
|
|
|
// DefaultFlashAttention is the flash-attention mode default; "auto" lets the
|
|
// backend enable it when the model + backend support it.
|
|
DefaultFlashAttention = "auto"
|
|
)
|