Files
LocalAI/.agents/llama-cpp-backend.md
LocalAI [bot] d77a9137d8 feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852)
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type

Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge,
2026-05-16) to pick up Multi-Token Prediction support.

No grpc-server.cpp changes are required: the existing `spec_type` option
delegates to upstream's `common_speculative_types_from_names()`, which
already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed
by MTP is auto-derived inside `common_context_params_to_llama` from
`params.speculative.need_n_rs_seq()`, and when no `draft_model` is set
the upstream server builds the MTP context off the target model itself.

Docs: extend the speculative-decoding section of the model-configuration
guide with the new type, both load paths (MTP head embedded in the main
GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended
`spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also
notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is
not wired through LocalAI's gRPC layer.

Agent guide: short note explaining that new upstream spec types are
picked up automatically and that MTP needs no gRPC plumbing.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load

Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:

  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.

Detection runs in two places:

  - The model importer (`POST /models/import-uri`, the `/import-model`
    UI) range-fetches the GGUF header for HuggingFace / direct-URL
    imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
    non-fatal error handling. OCI/Ollama URIs are skipped because the
    artifact is not directly streamable; the load-time hook covers them
    once the file is on disk.
  - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
    header on every model start and appends the same options if
    `spec_type` is not already set.

Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(importer): resolve huggingface:// URIs before MTP header probe

`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was
handing it the raw `huggingface://...` URI directly (and similarly for
any other custom downloader scheme). Live-test against
`huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf`
exposed this: the probe failed with `unsupported protocol scheme
"huggingface"`, was caught by the non-fatal error path, and the MTP
options were silently never applied to the generated YAML.

Route every candidate URI through `downloader.URI.ResolveURL()` and
require the resolved form to be HTTP(S). After the fix the probe
successfully reads `<arch>.nextn_predict_layers=1` from the real HF
GGUF and the emitted ConfigFile carries spec_type:draft-mtp,
spec_n_max:6, spec_p_min:0.75 as intended.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 22:42:48 +02:00

4.8 KiB

llama.cpp Backend

The llama.cpp backend (backend/cpp/llama-cpp/grpc-server.cpp) is a gRPC adaptation of the upstream HTTP server (llama.cpp/tools/server/server.cpp). It uses the same underlying server infrastructure from llama.cpp/tools/server/server-context.cpp.

Building and Testing

  • Test llama.cpp backend compilation: make backends/llama-cpp
  • The backend is built as part of the main build process
  • Check backend/cpp/llama-cpp/Makefile for build configuration

Architecture

  • grpc-server.cpp: gRPC server implementation, adapts HTTP server patterns to gRPC
  • Uses shared server infrastructure: server-context.cpp, server-task.cpp, server-queue.cpp, server-common.cpp
  • The gRPC server mirrors the HTTP server's functionality but uses gRPC instead of HTTP

Common Issues When Updating llama.cpp

When fixing compilation errors after upstream changes:

  1. Check how server.cpp (HTTP server) handles the same change
  2. Look for new public APIs or getter methods
  3. Store copies of needed data instead of accessing private members
  4. Update function calls to match new signatures
  5. Test with make backends/llama-cpp

Key Differences from HTTP Server

  • gRPC uses BackendServiceImpl class with gRPC service methods
  • HTTP server uses server_routes with HTTP handlers
  • Both use the same server_context and task queue infrastructure
  • gRPC methods: LoadModel, Predict, PredictStream, Embedding, Rerank, TokenizeString, GetMetrics, Health

Tool Call Parsing Maintenance

When working on JSON/XML tool call parsing functionality, always check llama.cpp for reference implementation and updates:

Checking for XML Parsing Changes

  1. Review XML Format Definitions: Check llama.cpp/common/chat-parser-xml-toolcall.h for xml_tool_call_format struct changes
  2. Review Parsing Logic: Check llama.cpp/common/chat-parser-xml-toolcall.cpp for parsing algorithm updates
  3. Review Format Presets: Check llama.cpp/common/chat-parser.cpp for new XML format presets (search for xml_tool_call_format form)
  4. Review Model Lists: Check llama.cpp/common/chat.h for COMMON_CHAT_FORMAT_* enum values that use XML parsing:
    • COMMON_CHAT_FORMAT_GLM_4_5
    • COMMON_CHAT_FORMAT_MINIMAX_M2
    • COMMON_CHAT_FORMAT_KIMI_K2
    • COMMON_CHAT_FORMAT_QWEN3_CODER_XML
    • COMMON_CHAT_FORMAT_APRIEL_1_5
    • COMMON_CHAT_FORMAT_XIAOMI_MIMO
    • Any new formats added

Model Configuration Options

Always check llama.cpp for new model configuration options that should be supported in LocalAI:

  1. Check Server Context: Review llama.cpp/tools/server/server-context.cpp for new parameters
  2. Check Chat Params: Review llama.cpp/common/chat.h for common_chat_params struct changes
  3. Check Server Options: Review llama.cpp/tools/server/server.cpp for command-line argument changes
  4. Examples of options to check:
    • ctx_shift - Context shifting support
    • parallel_tool_calls - Parallel tool calling
    • reasoning_format - Reasoning format options
    • Any new flags or parameters

Speculative Decoding Types

The spec_type option in grpc-server.cpp delegates to upstream's common_speculative_types_from_names(), so new speculative types added to the common_speculative_type_from_name map in common/speculative.cpp are picked up automatically with no code changes - only docs need an entry in docs/content/advanced/model-configuration.md. Current values: none, draft-simple, draft-eagle3, draft-mtp, ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod, ngram-cache.

draft-mtp (Multi-Token Prediction, ggml-org/llama.cpp#22673) does not need a separate draft GGUF: when spec_type includes draft-mtp and draftmodel is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing params.speculative.types plumbing and the derived cparams.n_rs_seq = params.speculative.need_n_rs_seq() in common_context_params_to_llama.

Implementation Guidelines

  1. Feature Parity: Always aim for feature parity with llama.cpp's implementation
  2. Test Coverage: Add tests for new features matching llama.cpp's behavior
  3. Documentation: Update relevant documentation when adding new formats or options
  4. Backward Compatibility: Ensure changes don't break existing functionality

Files to Monitor

  • llama.cpp/common/chat-parser-xml-toolcall.h - Format definitions
  • llama.cpp/common/chat-parser-xml-toolcall.cpp - Parsing logic
  • llama.cpp/common/chat-parser.cpp - Format presets and model-specific handlers
  • llama.cpp/common/chat.h - Format enums and parameter structures
  • llama.cpp/tools/server/server-context.cpp - Server configuration options