Implement llama.cpp GenAI Provider (#21690)

* Implement llama.cpp GenAI Provider * Add docs * Update links * Fix broken mqtt links * Fix more broken anchors
2026-03-09 09:40:13 -04:00 · 2026-01-18 06:34:30 -07:00
parent 3826d72c2a
commit 20360db2c9
5 changed files with 145 additions and 5 deletions
--- a/docs/docs/configuration/genai/config.md
+++ b/docs/docs/configuration/genai/config.md
@@ -5,7 +5,7 @@ title: Configuring Generative AI

 ## Configuration

-A Generative AI provider can be configured in the global config, which will make the Generative AI features available for use. There are currently 3 native providers available to integrate with Frigate. Other providers that support the OpenAI standard API can also be used. See the OpenAI section below.
+A Generative AI provider can be configured in the global config, which will make the Generative AI features available for use. There are currently 4 native providers available to integrate with Frigate. Other providers that support the OpenAI standard API can also be used. See the OpenAI section below.

 To use Generative AI, you must define a single provider at the global level of your Frigate configuration. If the provider you choose requires an API key, you may either directly paste it in your configuration, or store it in an environment variable prefixed with `FRIGATE_`.

@@ -77,8 +77,46 @@ genai:
  provider: ollama
  base_url: http://localhost:11434
  model: qwen3-vl:4b
+  provider_options: # other Ollama client options can be defined
+    keep_alive: -1
+    options:
+      num_ctx: 8192 # make sure the context matches other services that are using ollama
 ```

+## llama.cpp
+
+[llama.cpp](https://github.com/ggml-org/llama.cpp) is a C++ implementation of LLaMA that provides a high-performance inference server. Using llama.cpp directly gives you access to all native llama.cpp options and parameters.
+
+:::warning
+
+Using llama.cpp on CPU is not recommended, high inference times make using Generative AI impractical.
+
+:::
+
+It is highly recommended to host the llama.cpp server on a machine with a discrete graphics card, or on an Apple silicon Mac for best performance.
+
+### Supported Models
+
+You must use a vision capable model with Frigate. The llama.cpp server supports various vision models in GGUF format.
+
+### Configuration
+
+```yaml
+genai:
+  provider: llamacpp
+  base_url: http://localhost:8080
+  model: your-model-name
+  provider_options:
+    temperature: 0.7
+    repeat_penalty: 1.05
+    top_p: 0.8
+    top_k: 40
+    min_p: 0.05
+    seed: -1
+```
+
+All llama.cpp native options can be passed through `provider_options`, including `temperature`, `top_k`, `top_p`, `min_p`, `repeat_penalty`, `repeat_last_n`, `seed`, `grammar`, and more. See the [llama.cpp server documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) for a complete list of available parameters.
+
 ## Google Gemini

 Google Gemini has a [free tier](https://ai.google.dev/pricing) for the API, however the limits may not be sufficient for standard Frigate usage. Choose a plan appropriate for your installation.
@@ -185,4 +223,4 @@ genai:
  base_url: https://instance.cognitiveservices.azure.com/openai/responses?api-version=2025-04-01-preview
  model: gpt-5-mini
  api_key: "{FRIGATE_OPENAI_API_KEY}"
-```
+```
--- a/docs/docs/configuration/genai/objects.md
+++ b/docs/docs/configuration/genai/objects.md
@@ -11,7 +11,7 @@ By default, descriptions will be generated for all tracked objects and all zones

 Optionally, you can generate the description using a snapshot (if enabled) by setting `use_snapshot` to `True`. By default, this is set to `False`, which sends the uncompressed images from the `detect` stream collected over the object's lifetime to the model. Once the object lifecycle ends, only a single compressed and cropped thumbnail is saved with the tracked object. Using a snapshot might be useful when you want to _regenerate_ a tracked object's description as it will provide the AI with a higher-quality image (typically downscaled by the AI itself) than the cropped/compressed thumbnail. Using a snapshot otherwise has a trade-off in that only a single image is sent to your provider, which will limit the model's ability to determine object movement or direction.

-Generative AI object descriptions can also be toggled dynamically for a camera via MQTT with the topic `frigate/<camera_name>/object_descriptions/set`. See the [MQTT documentation](/integrations/mqtt/#frigatecamera_nameobjectdescriptionsset).
+Generative AI object descriptions can also be toggled dynamically for a camera via MQTT with the topic `frigate/<camera_name>/object_descriptions/set`. See the [MQTT documentation](/integrations/mqtt#frigatecamera_nameobject_descriptionsset).

 ## Usage and Best Practices

@@ -75,4 +75,4 @@ Many providers also have a public facing chat interface for their models. Downlo

 - OpenAI - [ChatGPT](https://chatgpt.com)
 - Gemini - [Google AI Studio](https://aistudio.google.com)
- Ollama - [Open WebUI](https://docs.openwebui.com/)
+- Ollama - [Open WebUI](https://docs.openwebui.com/)
--- a/docs/docs/configuration/genai/review_summaries.md
+++ b/docs/docs/configuration/genai/review_summaries.md
@@ -7,7 +7,7 @@ Generative AI can be used to automatically generate structured summaries of revi

 Requests for a summary are requested automatically to your AI provider for alert review items when the activity has ended, they can also be optionally enabled for detections as well.

-Generative AI review summaries can also be toggled dynamically for a [camera via MQTT](/integrations/mqtt/#frigatecamera_namereviewdescriptionsset).
+Generative AI review summaries can also be toggled dynamically for a [camera via MQTT](/integrations/mqtt#frigatecamera_namereview_descriptionsset).

 ## Review Summary Usage and Best Practices

--- a/frigate/config/camera/genai.py
+++ b/frigate/config/camera/genai.py
@@ -14,6 +14,7 @@ class GenAIProviderEnum(str, Enum):
    azure_openai = "azure_openai"
    gemini = "gemini"
    ollama = "ollama"
+    llamacpp = "llamacpp"


 class GenAIConfig(FrigateBaseModel):
--- a/frigate/genai/llama_cpp.py
+++ b/frigate/genai/llama_cpp.py
@@ -0,0 +1,101 @@
+"""llama.cpp Provider for Frigate AI."""
+
+import base64
+import logging
+from typing import Any, Optional
+
+import requests
+
+from frigate.config import GenAIProviderEnum
+from frigate.genai import GenAIClient, register_genai_provider
+
+logger = logging.getLogger(__name__)
+
+
+@register_genai_provider(GenAIProviderEnum.llamacpp)
+class LlamaCppClient(GenAIClient):
+    """Generative AI client for Frigate using llama.cpp server."""
+
+    LOCAL_OPTIMIZED_OPTIONS = {
+        "temperature": 0.7,
+        "repeat_penalty": 1.05,
+        "top_p": 0.8,
+    }
+
+    provider: str  # base_url
+    provider_options: dict[str, Any]
+
+    def _init_provider(self):
+        """Initialize the client."""
+        self.provider_options = {
+            **self.LOCAL_OPTIMIZED_OPTIONS,
+            **self.genai_config.provider_options,
+        }
+        return (
+            self.genai_config.base_url.rstrip("/")
+            if self.genai_config.base_url
+            else None
+        )
+
+    def _send(self, prompt: str, images: list[bytes]) -> Optional[str]:
+        """Submit a request to llama.cpp server."""
+        if self.provider is None:
+            logger.warning(
+                "llama.cpp provider has not been initialized, a description will not be generated. Check your llama.cpp configuration."
+            )
+            return None
+
+        try:
+            content = []
+            for image in images:
+                encoded_image = base64.b64encode(image).decode("utf-8")
+                content.append(
+                    {
+                        "type": "image_url",
+                        "image_url": {
+                            "url": f"data:image/jpeg;base64,{encoded_image}",
+                        },
+                    }
+                )
+            content.append(
+                {
+                    "type": "text",
+                    "text": prompt,
+                }
+            )
+
+            # Build request payload with llama.cpp native options
+            payload = {
+                "messages": [
+                    {
+                        "role": "user",
+                        "content": content,
+                    },
+                ],
+                **self.provider_options,
+            }
+
+            response = requests.post(
+                f"{self.provider}/v1/chat/completions",
+                json=payload,
+                timeout=self.timeout,
+            )
+            response.raise_for_status()
+            result = response.json()
+
+            if (
+                result is not None
+                and "choices" in result
+                and len(result["choices"]) > 0
+            ):
+                choice = result["choices"][0]
+                if "message" in choice and "content" in choice["message"]:
+                    return choice["message"]["content"].strip()
+            return None
+        except Exception as e:
+            logger.warning("llama.cpp returned an error: %s", str(e))
+            return None
+
+    def get_context_size(self) -> int:
+        """Get the context window size for llama.cpp."""
+        return self.genai_config.provider_options.get("context_size", 4096)