mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 04:56:52 -04:00
feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852)
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge, 2026-05-16) to pick up Multi-Token Prediction support. No grpc-server.cpp changes are required: the existing `spec_type` option delegates to upstream's `common_speculative_types_from_names()`, which already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed by MTP is auto-derived inside `common_context_params_to_llama` from `params.speculative.need_n_rs_seq()`, and when no `draft_model` is set the upstream server builds the MTP context off the target model itself. Docs: extend the speculative-decoding section of the model-configuration guide with the new type, both load paths (MTP head embedded in the main GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended `spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is not wired through LocalAI's gRPC layer. Agent guide: short note explaining that new upstream spec types are picked up automatically and that MTP needs no gRPC plumbing. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by `convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and, when present and the user has not configured a `spec_type` explicitly, auto-append the upstream-recommended speculative-decoding tuple: - spec_type:draft-mtp - spec_n_max:6 - spec_p_min:0.75 The 0.75 p_min is pinned defensively because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. Detection runs in two places: - The model importer (`POST /models/import-uri`, the `/import-model` UI) range-fetches the GGUF header for HuggingFace / direct-URL imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and non-fatal error handling. OCI/Ollama URIs are skipped because the artifact is not directly streamable; the load-time hook covers them once the file is on disk. - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local header on every model start and appends the same options if `spec_type` is not already set. Both paths share `ApplyMTPDefaults` and respect an explicit user-set `spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo specs cover the append, preserve-user-choice, legacy alias, and nil safety paths. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(importer): resolve huggingface:// URIs before MTP header probe `gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was handing it the raw `huggingface://...` URI directly (and similarly for any other custom downloader scheme). Live-test against `huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf` exposed this: the probe failed with `unsupported protocol scheme "huggingface"`, was caught by the non-fatal error path, and the MTP options were silently never applied to the generated YAML. Route every candidate URI through `downloader.URI.ResolveURL()` and require the resolved form to be HTTP(S). After the fix the probe successfully reads `<arch>.nextn_predict_layers=1` from the real HF GGUF and the emitted ConfigFile carries spec_type:draft-mtp, spec_n_max:6, spec_p_min:0.75 as intended. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
|
||||
- `reasoning_format` - Reasoning format options
|
||||
- Any new flags or parameters
|
||||
|
||||
### Speculative Decoding Types
|
||||
|
||||
The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
|
||||
|
||||
`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
|
||||
|
||||
### Implementation Guidelines
|
||||
|
||||
1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
|
||||
LLAMA_VERSION?=1348f67c58f561808136e8a152a9eddec168f221
|
||||
LLAMA_VERSION?=0253fb21f595246f54c192fe8332f34173be251b
|
||||
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
|
||||
|
||||
CMAKE_ARGS?=
|
||||
|
||||
@@ -54,6 +54,13 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
|
||||
cfg.modelTemplate = chatTemplate.ValueString()
|
||||
}
|
||||
|
||||
// Auto-enable Multi-Token Prediction (ggml-org/llama.cpp#22673) when the
|
||||
// GGUF carries an embedded MTP head. Skipped silently for non-MTP models
|
||||
// and when the user already configured a spec_type.
|
||||
if n, ok := HasEmbeddedMTPHead(f); ok {
|
||||
ApplyMTPDefaults(cfg, n)
|
||||
}
|
||||
|
||||
// Thinking support detection is done after model load via DetectThinkingSupportFromBackend
|
||||
|
||||
// template estimations
|
||||
|
||||
84
core/config/mtp.go
Normal file
84
core/config/mtp.go
Normal file
@@ -0,0 +1,84 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"strings"
|
||||
|
||||
gguf "github.com/gpustack/gguf-parser-go"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// mtpSpecOptions lists the speculative-decoding option keys auto-applied when
|
||||
// an MTP head is detected on a llama-cpp GGUF. Defaults track the upstream
|
||||
// MTP PR (ggml-org/llama.cpp#22673):
|
||||
//
|
||||
// - spec_type:draft-mtp activates Multi-Token Prediction
|
||||
// - spec_n_max:6 draft window
|
||||
// - spec_p_min:0.75 pinned because upstream marked the 0.75 default
|
||||
// with a "change to 0.0f" TODO; locking it here keeps acceptance
|
||||
// thresholds stable across future bumps
|
||||
var mtpSpecOptions = []string{
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}
|
||||
|
||||
// MTPSpecOptions returns a copy of the option keys auto-applied when an MTP
|
||||
// head is detected. Exported for testing and for the importer.
|
||||
func MTPSpecOptions() []string {
|
||||
out := make([]string, len(mtpSpecOptions))
|
||||
copy(out, mtpSpecOptions)
|
||||
return out
|
||||
}
|
||||
|
||||
// HasEmbeddedMTPHead reports whether the parsed GGUF declares a Multi-Token
|
||||
// Prediction head. Detection reads `<arch>.nextn_predict_layers`, which is
|
||||
// what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
|
||||
// `conversion/qwen.py` MTP mixin. A positive layer count means the head is
|
||||
// present in the same GGUF as the trunk.
|
||||
func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
|
||||
if f == nil {
|
||||
return 0, false
|
||||
}
|
||||
arch := f.Architecture().Architecture
|
||||
if arch == "" {
|
||||
return 0, false
|
||||
}
|
||||
v, ok := f.Header.MetadataKV.Get(arch + ".nextn_predict_layers")
|
||||
if !ok {
|
||||
return 0, false
|
||||
}
|
||||
n := gguf.ValueNumeric[uint32](v)
|
||||
return n, n > 0
|
||||
}
|
||||
|
||||
// hasSpecTypeOption returns true when the slice already contains a
|
||||
// user-configured `spec_type:` / `speculative_type:` entry. Used to avoid
|
||||
// clobbering an explicit choice with the MTP auto-defaults.
|
||||
func hasSpecTypeOption(opts []string) bool {
|
||||
for _, o := range opts {
|
||||
if strings.HasPrefix(o, "spec_type:") || strings.HasPrefix(o, "speculative_type:") {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// ApplyMTPDefaults appends the auto-MTP option keys to cfg.Options when none
|
||||
// is already configured. It is a no-op when the user already picked a
|
||||
// `spec_type` (either via YAML or via the importer's preferences flow).
|
||||
//
|
||||
// `layers` is the value read from `<arch>.nextn_predict_layers` and is only
|
||||
// used for the diagnostic log line.
|
||||
func ApplyMTPDefaults(cfg *ModelConfig, layers uint32) {
|
||||
if cfg == nil {
|
||||
return
|
||||
}
|
||||
if hasSpecTypeOption(cfg.Options) {
|
||||
xlog.Debug("[mtp] embedded MTP head detected but spec_type already configured; leaving user choice intact",
|
||||
"name", cfg.Name, "nextn_layers", layers)
|
||||
return
|
||||
}
|
||||
cfg.Options = append(cfg.Options, mtpSpecOptions...)
|
||||
xlog.Info("[mtp] embedded MTP head detected; enabling draft-mtp speculative decoding",
|
||||
"name", cfg.Name, "nextn_layers", layers, "spec_n_max", 6, "spec_p_min", 0.75)
|
||||
}
|
||||
86
core/config/mtp_test.go
Normal file
86
core/config/mtp_test.go
Normal file
@@ -0,0 +1,86 @@
|
||||
package config_test
|
||||
|
||||
import (
|
||||
. "github.com/mudler/LocalAI/core/config"
|
||||
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
)
|
||||
|
||||
var _ = Describe("MTP auto-defaults", func() {
|
||||
Context("MTPSpecOptions", func() {
|
||||
It("returns the upstream-recommended speculative tuple", func() {
|
||||
Expect(MTPSpecOptions()).To(Equal([]string{
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}))
|
||||
})
|
||||
|
||||
It("returns a defensive copy so callers cannot mutate the package default", func() {
|
||||
opts := MTPSpecOptions()
|
||||
opts[0] = "spec_type:none"
|
||||
Expect(MTPSpecOptions()[0]).To(Equal("spec_type:draft-mtp"))
|
||||
})
|
||||
})
|
||||
|
||||
Context("ApplyMTPDefaults", func() {
|
||||
It("appends MTP options when nothing is configured", func() {
|
||||
cfg := &ModelConfig{Name: "qwen-mtp"}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}))
|
||||
})
|
||||
|
||||
It("preserves unrelated options already on the config", func() {
|
||||
cfg := &ModelConfig{
|
||||
Name: "qwen-mtp",
|
||||
Options: []string{"use_jinja:true", "cache_reuse:256"},
|
||||
}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{
|
||||
"use_jinja:true",
|
||||
"cache_reuse:256",
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}))
|
||||
})
|
||||
|
||||
It("is a no-op when the user already configured spec_type", func() {
|
||||
cfg := &ModelConfig{
|
||||
Name: "qwen-mtp",
|
||||
Options: []string{"spec_type:ngram-simple", "use_jinja:true"},
|
||||
}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{
|
||||
"spec_type:ngram-simple",
|
||||
"use_jinja:true",
|
||||
}))
|
||||
})
|
||||
|
||||
It("also respects the legacy speculative_type alias", func() {
|
||||
cfg := &ModelConfig{
|
||||
Name: "qwen-mtp",
|
||||
Options: []string{"speculative_type:ngram-mod"},
|
||||
}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{"speculative_type:ngram-mod"}))
|
||||
})
|
||||
|
||||
It("tolerates a nil config", func() {
|
||||
Expect(func() { ApplyMTPDefaults(nil, 1) }).ToNot(Panic())
|
||||
})
|
||||
})
|
||||
|
||||
Context("HasEmbeddedMTPHead", func() {
|
||||
It("returns false on a nil GGUF file", func() {
|
||||
n, ok := HasEmbeddedMTPHead(nil)
|
||||
Expect(ok).To(BeFalse())
|
||||
Expect(n).To(BeZero())
|
||||
})
|
||||
})
|
||||
})
|
||||
@@ -1,10 +1,13 @@
|
||||
package importers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
gguf "github.com/gpustack/gguf-parser-go"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/gallery"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
@@ -261,6 +264,13 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
|
||||
// Apply per-model-family inference parameter defaults
|
||||
config.ApplyInferenceDefaults(&modelConfig, details.URI)
|
||||
|
||||
// Auto-detect Multi-Token Prediction heads (ggml-org/llama.cpp#22673) and
|
||||
// enable speculative decoding. Mirrors the load-time hook so freshly
|
||||
// imported configs already carry spec_type:draft-mtp before the model is
|
||||
// ever loaded - users see it in the YAML preview rather than discovering
|
||||
// it after the first start.
|
||||
maybeApplyMTPDefaults(&modelConfig, details, &cfg)
|
||||
|
||||
data, err := yaml.Marshal(modelConfig)
|
||||
if err != nil {
|
||||
return gallery.ModelConfig{}, err
|
||||
@@ -291,6 +301,85 @@ func pickPreferredGroup(groups []hfapi.ShardGroup, prefs []string) *hfapi.ShardG
|
||||
return &groups[len(groups)-1]
|
||||
}
|
||||
|
||||
// maybeApplyMTPDefaults parses the picked GGUF header (range-fetched over
|
||||
// HTTP for HF/URL imports) and, if the file declares a Multi-Token Prediction
|
||||
// head, appends the auto-MTP option keys to modelConfig.Options. Failures
|
||||
// during the probe are non-fatal: the importer keeps the config without MTP
|
||||
// so an unrelated network blip or weird header doesn't break the import.
|
||||
//
|
||||
// OCI/Ollama URIs are skipped because the artifact isn't directly fetchable
|
||||
// as a GGUF byte stream - the load-time hook (core/config/gguf.go) covers
|
||||
// those once the model is materialised on disk.
|
||||
func maybeApplyMTPDefaults(modelConfig *config.ModelConfig, details Details, cfg *gallery.ModelConfig) {
|
||||
probeURL := pickMTPProbeURL(details, cfg)
|
||||
if probeURL == "" {
|
||||
return
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cancel()
|
||||
|
||||
defer func() {
|
||||
if r := recover(); r != nil {
|
||||
xlog.Debug("[mtp-importer] panic while probing GGUF header", "uri", probeURL, "recover", r)
|
||||
}
|
||||
}()
|
||||
|
||||
f, err := gguf.ParseGGUFFileRemote(ctx, probeURL)
|
||||
if err != nil {
|
||||
xlog.Debug("[mtp-importer] failed to read remote GGUF header for MTP detection", "uri", probeURL, "error", err)
|
||||
return
|
||||
}
|
||||
|
||||
n, ok := config.HasEmbeddedMTPHead(f)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
config.ApplyMTPDefaults(modelConfig, n)
|
||||
}
|
||||
|
||||
// pickMTPProbeURL returns an HTTP(S) URL pointing at the main (non-mmproj)
|
||||
// GGUF shard that should be inspected for an MTP head, or "" when no
|
||||
// suitable URL is available. Custom URI schemes (`huggingface://`,
|
||||
// `ollama://`, etc.) are run through `downloader.URI.ResolveURL` so the
|
||||
// resulting URL is something `gguf.ParseGGUFFileRemote` can actually open.
|
||||
// OCI/Ollama URIs are skipped because the artifact is not directly
|
||||
// streamable as a GGUF byte range.
|
||||
func pickMTPProbeURL(details Details, cfg *gallery.ModelConfig) string {
|
||||
uri := downloader.URI(details.URI)
|
||||
|
||||
if uri.LooksLikeOCI() {
|
||||
return ""
|
||||
}
|
||||
|
||||
if strings.HasSuffix(strings.ToLower(details.URI), ".gguf") {
|
||||
return resolveHTTPProbe(details.URI)
|
||||
}
|
||||
|
||||
for _, f := range cfg.Files {
|
||||
lower := strings.ToLower(f.Filename)
|
||||
if strings.Contains(lower, "mmproj") {
|
||||
continue
|
||||
}
|
||||
if !strings.HasSuffix(lower, ".gguf") {
|
||||
continue
|
||||
}
|
||||
return resolveHTTPProbe(f.URI)
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// resolveHTTPProbe resolves an importer-side URI to the HTTP(S) URL that
|
||||
// `gguf.ParseGGUFFileRemote` can range-fetch. Returns "" if the URI can't
|
||||
// be reduced to an HTTP(S) endpoint (e.g. local path, unsupported scheme).
|
||||
func resolveHTTPProbe(uri string) string {
|
||||
resolved := downloader.URI(uri).ResolveURL()
|
||||
if downloader.URI(resolved).LooksLikeHTTPURL() {
|
||||
return resolved
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// appendShardGroup copies every shard of group into cfg.Files under dest,
|
||||
// skipping any entry whose target filename is already present so repeated
|
||||
// calls (e.g. the rare case of mmproj + model picking the same group)
|
||||
|
||||
@@ -323,6 +323,7 @@ The canonical names match upstream llama.cpp (dash-separated). For backward comp
|
||||
| `none` | | No speculative decoding (default) |
|
||||
| `draft-simple` | `draft`, `draft_simple` | Draft model-based speculation (auto-set when `draft_model` is configured) |
|
||||
| `draft-eagle3` | `eagle3`, `draft_eagle3` | EAGLE3 draft model architecture |
|
||||
| `draft-mtp` | `draft_mtp` | Multi-Token Prediction. Reuses the target model's embedded MTP head; no separate draft GGUF required (`draft_model` can be omitted). |
|
||||
| `ngram-simple` | `ngram_simple` | Simple self-speculative using token history |
|
||||
| `ngram-map-k` | `ngram_map_k` | N-gram with key-only map |
|
||||
| `ngram-map-k4v` | `ngram_map_k4v` | N-gram with keys and 4 m-gram values |
|
||||
@@ -335,6 +336,71 @@ Multiple types can be chained by passing a comma-separated list to `spec_type` (
|
||||
Speculative decoding is automatically disabled when multimodal models (with `mmproj`) are active. The `n_draft` parameter can also be overridden per-request.
|
||||
{{% /notice %}}
|
||||
|
||||
##### Multi-Token Prediction (MTP)
|
||||
|
||||
`draft-mtp` enables [Multi-Token Prediction](https://github.com/ggml-org/llama.cpp/pull/22673) (ggml-org/llama.cpp#22673). MTP uses a small prediction head trained into the target model: the head runs alongside the main forward pass and proposes the next few tokens, which the target then verifies in a single batched step. Upstream reports ~1.85x-2.1x token throughput at ~72-82% draft acceptance on Qwen3.6 27B / 35B A3B.
|
||||
|
||||
**Auto-detection (default).** When a GGUF declares an MTP head (the upstream `<arch>.nextn_predict_layers` metadata key, set by `convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar), LocalAI auto-enables MTP with the following defaults:
|
||||
|
||||
```yaml
|
||||
options:
|
||||
- spec_type:draft-mtp
|
||||
- spec_n_max:6
|
||||
- spec_p_min:0.75
|
||||
```
|
||||
|
||||
Detection runs both at **import time** (the `/import-model` UI / `POST /models/import-uri` flow range-fetches the GGUF header and writes the options into the generated YAML before you save it) and at **load time** (every llama-cpp model start re-checks the local header and appends the options if `spec_type` isn't already set). To opt out, set an explicit `spec_type:` / `speculative_type:` in your YAML - auto-detection always preserves the user value, including `spec_type:none`.
|
||||
|
||||
**Two ways to load the MTP head:**
|
||||
|
||||
1. **Embedded in the target GGUF** (the recommended path for LocalAI, and what auto-detection assumes). When `spec_type` includes `draft-mtp` and `draft_model` is empty, the backend builds the MTP draft context directly from the target model's weights. The GGUF must have been converted with the MTP tensors included.
|
||||
2. **Separate `mtp-*.gguf` sibling file.** If you point `draft_model` at the separate MTP-head GGUF that ships next to the main weights on HuggingFace, the backend will load it as a draft model. Note: upstream's `-hf` auto-discovery of `mtp-*.gguf` siblings is **not** wired into LocalAI's gRPC layer - you need to download the sibling file and configure `draft_model` explicitly.
|
||||
|
||||
**Manual override knobs** (overlap with the auto-detect defaults above):
|
||||
|
||||
| Option | Recommended | Notes |
|
||||
|--------|------------|-------|
|
||||
| `spec_type` | `draft-mtp` | Activates MTP. Can be chained with other types (see below). |
|
||||
| `spec_n_max` / `draft_max` | `2`-`6` | Number of draft tokens per step. Upstream's PR suggests 2-3 for the tightest acceptance window; LocalAI's auto-default is 6 to favour throughput on models with high acceptance. |
|
||||
| `spec_p_min` | `0.75` | Pinned because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. |
|
||||
| `mmproj_use_gpu` | `false` (or unset `mmproj`) | MTP has a prompt-processing overhead; if the model is non-vision, drop the mmproj entirely to save VRAM. |
|
||||
|
||||
**Minimal config** (override-only, since auto-detection already covers this for MTP-capable GGUFs):
|
||||
|
||||
```yaml
|
||||
name: qwen3-mtp
|
||||
backend: llama-cpp
|
||||
parameters:
|
||||
model: qwen3-27b-with-mtp.gguf
|
||||
options:
|
||||
- spec_type:draft-mtp
|
||||
- spec_n_max:3
|
||||
```
|
||||
|
||||
**With a separate MTP head file:**
|
||||
|
||||
```yaml
|
||||
name: qwen3-mtp
|
||||
backend: llama-cpp
|
||||
parameters:
|
||||
model: qwen3-27b.gguf
|
||||
draft_model: qwen3-27b-mtp-head.gguf
|
||||
options:
|
||||
- spec_type:draft-mtp
|
||||
- spec_n_max:3
|
||||
```
|
||||
|
||||
**Chaining MTP with n-gram fallback** (experimental, from the PR's usage notes - useful when MTP acceptance drops on highly repetitive output):
|
||||
|
||||
```yaml
|
||||
options:
|
||||
- spec_type:draft-mtp,ngram-mod
|
||||
- spec_n_max:3
|
||||
- spec_ngram_mod_n_match:24
|
||||
```
|
||||
|
||||
Pre-converted GGUFs with MTP heads are published on the [ggml-org HuggingFace org](https://huggingface.co/ggml-org) (initially Qwen3.6 27B and Qwen3.6 35B A3B).
|
||||
|
||||
### Reasoning Models (DeepSeek-R1, Qwen3, etc.)
|
||||
|
||||
These load-time options control how the backend parses `<think>` reasoning blocks and how much budget the model is allowed for thinking. They are set per model via the `options:` array.
|
||||
|
||||
Reference in New Issue
Block a user