mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 21:21:23 -04:00
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge, 2026-05-16) to pick up Multi-Token Prediction support. No grpc-server.cpp changes are required: the existing `spec_type` option delegates to upstream's `common_speculative_types_from_names()`, which already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed by MTP is auto-derived inside `common_context_params_to_llama` from `params.speculative.need_n_rs_seq()`, and when no `draft_model` is set the upstream server builds the MTP context off the target model itself. Docs: extend the speculative-decoding section of the model-configuration guide with the new type, both load paths (MTP head embedded in the main GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended `spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is not wired through LocalAI's gRPC layer. Agent guide: short note explaining that new upstream spec types are picked up automatically and that MTP needs no gRPC plumbing. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by `convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and, when present and the user has not configured a `spec_type` explicitly, auto-append the upstream-recommended speculative-decoding tuple: - spec_type:draft-mtp - spec_n_max:6 - spec_p_min:0.75 The 0.75 p_min is pinned defensively because upstream marks the current default with a "change to 0.0f" TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps. Detection runs in two places: - The model importer (`POST /models/import-uri`, the `/import-model` UI) range-fetches the GGUF header for HuggingFace / direct-URL imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and non-fatal error handling. OCI/Ollama URIs are skipped because the artifact is not directly streamable; the load-time hook covers them once the file is on disk. - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local header on every model start and appends the same options if `spec_type` is not already set. Both paths share `ApplyMTPDefaults` and respect an explicit user-set `spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo specs cover the append, preserve-user-choice, legacy alias, and nil safety paths. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(importer): resolve huggingface:// URIs before MTP header probe `gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was handing it the raw `huggingface://...` URI directly (and similarly for any other custom downloader scheme). Live-test against `huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf` exposed this: the probe failed with `unsupported protocol scheme "huggingface"`, was caught by the non-fatal error path, and the MTP options were silently never applied to the generated YAML. Route every candidate URI through `downloader.URI.ResolveURL()` and require the resolved form to be HTTP(S). After the fix the probe successfully reads `<arch>.nextn_predict_layers=1` from the real HF GGUF and the emitted ConfigFile carries spec_type:draft-mtp, spec_n_max:6, spec_p_min:0.75 as intended. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
84 lines
4.8 KiB
Markdown
84 lines
4.8 KiB
Markdown
# llama.cpp Backend
|
|
|
|
The llama.cpp backend (`backend/cpp/llama-cpp/grpc-server.cpp`) is a gRPC adaptation of the upstream HTTP server (`llama.cpp/tools/server/server.cpp`). It uses the same underlying server infrastructure from `llama.cpp/tools/server/server-context.cpp`.
|
|
|
|
## Building and Testing
|
|
|
|
- Test llama.cpp backend compilation: `make backends/llama-cpp`
|
|
- The backend is built as part of the main build process
|
|
- Check `backend/cpp/llama-cpp/Makefile` for build configuration
|
|
|
|
## Architecture
|
|
|
|
- **grpc-server.cpp**: gRPC server implementation, adapts HTTP server patterns to gRPC
|
|
- Uses shared server infrastructure: `server-context.cpp`, `server-task.cpp`, `server-queue.cpp`, `server-common.cpp`
|
|
- The gRPC server mirrors the HTTP server's functionality but uses gRPC instead of HTTP
|
|
|
|
## Common Issues When Updating llama.cpp
|
|
|
|
When fixing compilation errors after upstream changes:
|
|
1. Check how `server.cpp` (HTTP server) handles the same change
|
|
2. Look for new public APIs or getter methods
|
|
3. Store copies of needed data instead of accessing private members
|
|
4. Update function calls to match new signatures
|
|
5. Test with `make backends/llama-cpp`
|
|
|
|
## Key Differences from HTTP Server
|
|
|
|
- gRPC uses `BackendServiceImpl` class with gRPC service methods
|
|
- HTTP server uses `server_routes` with HTTP handlers
|
|
- Both use the same `server_context` and task queue infrastructure
|
|
- gRPC methods: `LoadModel`, `Predict`, `PredictStream`, `Embedding`, `Rerank`, `TokenizeString`, `GetMetrics`, `Health`
|
|
|
|
## Tool Call Parsing Maintenance
|
|
|
|
When working on JSON/XML tool call parsing functionality, always check llama.cpp for reference implementation and updates:
|
|
|
|
### Checking for XML Parsing Changes
|
|
|
|
1. **Review XML Format Definitions**: Check `llama.cpp/common/chat-parser-xml-toolcall.h` for `xml_tool_call_format` struct changes
|
|
2. **Review Parsing Logic**: Check `llama.cpp/common/chat-parser-xml-toolcall.cpp` for parsing algorithm updates
|
|
3. **Review Format Presets**: Check `llama.cpp/common/chat-parser.cpp` for new XML format presets (search for `xml_tool_call_format form`)
|
|
4. **Review Model Lists**: Check `llama.cpp/common/chat.h` for `COMMON_CHAT_FORMAT_*` enum values that use XML parsing:
|
|
- `COMMON_CHAT_FORMAT_GLM_4_5`
|
|
- `COMMON_CHAT_FORMAT_MINIMAX_M2`
|
|
- `COMMON_CHAT_FORMAT_KIMI_K2`
|
|
- `COMMON_CHAT_FORMAT_QWEN3_CODER_XML`
|
|
- `COMMON_CHAT_FORMAT_APRIEL_1_5`
|
|
- `COMMON_CHAT_FORMAT_XIAOMI_MIMO`
|
|
- Any new formats added
|
|
|
|
### Model Configuration Options
|
|
|
|
Always check `llama.cpp` for new model configuration options that should be supported in LocalAI:
|
|
|
|
1. **Check Server Context**: Review `llama.cpp/tools/server/server-context.cpp` for new parameters
|
|
2. **Check Chat Params**: Review `llama.cpp/common/chat.h` for `common_chat_params` struct changes
|
|
3. **Check Server Options**: Review `llama.cpp/tools/server/server.cpp` for command-line argument changes
|
|
4. **Examples of options to check**:
|
|
- `ctx_shift` - Context shifting support
|
|
- `parallel_tool_calls` - Parallel tool calling
|
|
- `reasoning_format` - Reasoning format options
|
|
- Any new flags or parameters
|
|
|
|
### Speculative Decoding Types
|
|
|
|
The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
|
|
|
|
`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
|
|
|
|
### Implementation Guidelines
|
|
|
|
1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
|
|
2. **Test Coverage**: Add tests for new features matching llama.cpp's behavior
|
|
3. **Documentation**: Update relevant documentation when adding new formats or options
|
|
4. **Backward Compatibility**: Ensure changes don't break existing functionality
|
|
|
|
### Files to Monitor
|
|
|
|
- `llama.cpp/common/chat-parser-xml-toolcall.h` - Format definitions
|
|
- `llama.cpp/common/chat-parser-xml-toolcall.cpp` - Parsing logic
|
|
- `llama.cpp/common/chat-parser.cpp` - Format presets and model-specific handlers
|
|
- `llama.cpp/common/chat.h` - Format enums and parameter structures
|
|
- `llama.cpp/tools/server/server-context.cpp` - Server configuration options
|