Files
LocalAI/.agents/llama-cpp-backend.md
Ettore Di Giacinto 8818452d85 feat(ui): MCP Apps, mcp streaming and client-side support (#8947)
* Revert "fix: Add timeout-based wait for model deletion completion (#8756)"

This reverts commit 9e1b0d0c82.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add mcp prompts and resources

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add client-side MCP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): allow to authenticate MCP servers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add MCP Apps

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: update AGENTS

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: allow to collapse navbar, save state in storage

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add MCP button also to home page

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(chat): populate string content

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-11 07:30:49 +01:00

3.8 KiB

llama.cpp Backend

The llama.cpp backend (backend/cpp/llama-cpp/grpc-server.cpp) is a gRPC adaptation of the upstream HTTP server (llama.cpp/tools/server/server.cpp). It uses the same underlying server infrastructure from llama.cpp/tools/server/server-context.cpp.

Building and Testing

  • Test llama.cpp backend compilation: make backends/llama-cpp
  • The backend is built as part of the main build process
  • Check backend/cpp/llama-cpp/Makefile for build configuration

Architecture

  • grpc-server.cpp: gRPC server implementation, adapts HTTP server patterns to gRPC
  • Uses shared server infrastructure: server-context.cpp, server-task.cpp, server-queue.cpp, server-common.cpp
  • The gRPC server mirrors the HTTP server's functionality but uses gRPC instead of HTTP

Common Issues When Updating llama.cpp

When fixing compilation errors after upstream changes:

  1. Check how server.cpp (HTTP server) handles the same change
  2. Look for new public APIs or getter methods
  3. Store copies of needed data instead of accessing private members
  4. Update function calls to match new signatures
  5. Test with make backends/llama-cpp

Key Differences from HTTP Server

  • gRPC uses BackendServiceImpl class with gRPC service methods
  • HTTP server uses server_routes with HTTP handlers
  • Both use the same server_context and task queue infrastructure
  • gRPC methods: LoadModel, Predict, PredictStream, Embedding, Rerank, TokenizeString, GetMetrics, Health

Tool Call Parsing Maintenance

When working on JSON/XML tool call parsing functionality, always check llama.cpp for reference implementation and updates:

Checking for XML Parsing Changes

  1. Review XML Format Definitions: Check llama.cpp/common/chat-parser-xml-toolcall.h for xml_tool_call_format struct changes
  2. Review Parsing Logic: Check llama.cpp/common/chat-parser-xml-toolcall.cpp for parsing algorithm updates
  3. Review Format Presets: Check llama.cpp/common/chat-parser.cpp for new XML format presets (search for xml_tool_call_format form)
  4. Review Model Lists: Check llama.cpp/common/chat.h for COMMON_CHAT_FORMAT_* enum values that use XML parsing:
    • COMMON_CHAT_FORMAT_GLM_4_5
    • COMMON_CHAT_FORMAT_MINIMAX_M2
    • COMMON_CHAT_FORMAT_KIMI_K2
    • COMMON_CHAT_FORMAT_QWEN3_CODER_XML
    • COMMON_CHAT_FORMAT_APRIEL_1_5
    • COMMON_CHAT_FORMAT_XIAOMI_MIMO
    • Any new formats added

Model Configuration Options

Always check llama.cpp for new model configuration options that should be supported in LocalAI:

  1. Check Server Context: Review llama.cpp/tools/server/server-context.cpp for new parameters
  2. Check Chat Params: Review llama.cpp/common/chat.h for common_chat_params struct changes
  3. Check Server Options: Review llama.cpp/tools/server/server.cpp for command-line argument changes
  4. Examples of options to check:
    • ctx_shift - Context shifting support
    • parallel_tool_calls - Parallel tool calling
    • reasoning_format - Reasoning format options
    • Any new flags or parameters

Implementation Guidelines

  1. Feature Parity: Always aim for feature parity with llama.cpp's implementation
  2. Test Coverage: Add tests for new features matching llama.cpp's behavior
  3. Documentation: Update relevant documentation when adding new formats or options
  4. Backward Compatibility: Ensure changes don't break existing functionality

Files to Monitor

  • llama.cpp/common/chat-parser-xml-toolcall.h - Format definitions
  • llama.cpp/common/chat-parser-xml-toolcall.cpp - Parsing logic
  • llama.cpp/common/chat-parser.cpp - Format presets and model-specific handlers
  • llama.cpp/common/chat.h - Format enums and parameter structures
  • llama.cpp/tools/server/server-context.cpp - Server configuration options