The GGUF metadata parser (gpustack/gguf-parser-go) cannot read NVFP4-quantized
GGUFs at all: it errors with "read tensor info 0: This quantized type is
currently unsupported" because NVFP4 is a ggml tensor type it does not know.
When ParseGGUFFile errors, the llama-cpp defaults hook skips guessGGUFFromFile
entirely and the deferred fallback sets the context window to the conservative
GGUFFallbackContextSize (1024). The result: a model that trains to 262144
tokens runs with n_ctx=1024, and every prompt over ~1k tokens fails with
"request (N tokens) exceeds the available context size (1024 tokens)".
Two changes:
- Drop GGUFFallbackContextSize (1024) and fall back to DefaultContextSize
(4096) in both the GGUF run-estimate path (gguf.go) and the deferred hook
fallback (hooks_llamacpp.go). 1024 is a sensible floor for a tiny CPU GGUF
but a footgun for a large, long-context model whose header simply cannot be
parsed. Strengthen the existing "GGUF unreadable" test to assert the value.
- Set context_size explicitly on the four NVFP4 gallery entries
(qwen3.6-35b-a3b-nvfp4-mtp, qwopus3.6-27b-v2-mtp-nvfp4,
qwopus3.6-27b-coder-mtp-nvfp4, qwen3.6-27b-nvfp4-mtp) so the parser failure
is irrelevant for them. 32768 matches sibling Qwen entries and is safe on
memory; operators can raise it toward the 262144 train length.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>