feat(gallery): add Gemma 4 QAT family + MTP speculative-decoding pairs (#10215)

Add the remaining official Google Gemma 4 QAT Q4_0 GGUFs (E2B, E4B, 26B-A4B, 31B) next to the existing 12B entry, each shipping its multimodal mmproj. Also add three MTP (Multi-Token Prediction) speculative-decoding bundles that pair each QAT target with a QAT-matched assistant/drafter head: - 12B <- Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF - 26B-A4B <- boxwrench/gemma-4-qat-mtp-assistant-heads - 31B <- boxwrench/gemma-4-qat-mtp-assistant-heads The assistant heads use the gemma4_assistant architecture and are not standalone chat models, so each entry bundles the target + draft and sets draft_model together with the draft-mtp spec options (spec_type:draft-mtp / spec_n_max:6 / spec_p_min:0.75), matching MTPSpecOptions() in core/config/mtp.go. QAT-matched heads raise draft acceptance substantially over generic non-QAT heads. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-30 09:57:57 -04:00 · 2026-06-08 10:26:42 +02:00
parent 92dea961c2
commit 618e90cd13
1 changed files with 329 additions and 0 deletions
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -106,6 +106,335 @@
    - filename: llama-cpp/mmproj/gemma-4-12B-it-qat-q4_0-gguf/mmproj-gemma-4-12b-it-qat-q4_0.gguf
      sha256: e70b0e5cd80323d5d588b4ed06780356b7b1ba03995a4b8164c6ae9db0ff5989
      uri: https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+- name: "gemma-4-e2b-it-qat-q4_0"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf
+  description: |
+    Gemma 4 E2B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E2B is a MatFormer "effective 2B" elastic variant: it carries a larger backbone but runs at an effective 2B-parameter footprint, making it well suited to lightweight and on-device deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector.
+
+    License: Apache 2.0 | Authors: Google DeepMind
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qat
+    - multimodal
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-E2B-it-qat-q4_0-gguf/gemma-4-E2B-it-mmproj.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-E2B-it-qat-q4_0-gguf/gemma-4-E2B_q4_0-it.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-E2B-it-qat-q4_0-gguf/gemma-4-E2B_q4_0-it.gguf
+      sha256: 3646b4c147cd235a44d91df1546d3b7d8e29b547dbe4e1f80856419aa455e6fd
+      uri: https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B_q4_0-it.gguf
+    - filename: llama-cpp/mmproj/gemma-4-E2B-it-qat-q4_0-gguf/gemma-4-E2B-it-mmproj.gguf
+      sha256: 58c187648007cab392bd5678b87e862c3e8794017deb945feea2cf256195e96a
+      uri: https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B-it-mmproj.gguf
+- name: "gemma-4-e4b-it-qat-q4_0"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-gguf
+  description: |
+    Gemma 4 E4B is a multimodal (text + image) instruction-tuned model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. E4B is a MatFormer "effective 4B" elastic variant, balancing quality and footprint for on-device and edge deployments. This is the official Google Q4_0 GGUF, shipped with its multimodal projector.
+
+    License: Apache 2.0 | Authors: Google DeepMind
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qat
+    - multimodal
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-E4B-it-qat-q4_0-gguf/gemma-4-E4B-it-mmproj.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-E4B-it-qat-q4_0-gguf/gemma-4-E4B_q4_0-it.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-E4B-it-qat-q4_0-gguf/gemma-4-E4B_q4_0-it.gguf
+      sha256: e8b6a059ba86947a44ace84d6e5679795bc41862c25c30513142588f0e9dba1d
+      uri: https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-gguf/resolve/main/gemma-4-E4B_q4_0-it.gguf
+    - filename: llama-cpp/mmproj/gemma-4-E4B-it-qat-q4_0-gguf/gemma-4-E4B-it-mmproj.gguf
+      sha256: c6398448d84a4836fdedf58f9775979e69ae0cc4dfdf4d697b5597693a555b12
+      uri: https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-gguf/resolve/main/gemma-4-E4B-it-mmproj.gguf
+- name: "gemma-4-26b-a4b-it-qat-q4_0"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf
+  description: |
+    Gemma 4 26B-A4B is a multimodal (text + image) instruction-tuned Mixture-of-Experts model from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality at a fraction of the memory. With 26B total parameters and ~4B active per token, it delivers large-model quality at a much lower inference cost. This is the official Google Q4_0 GGUF, shipped with its multimodal projector.
+
+    License: Apache 2.0 | Authors: Google DeepMind
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qat
+    - multimodal
+    - moe
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B-it-mmproj.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B_q4_0-it.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B_q4_0-it.gguf
+      sha256: 4c856523d61d77922dbc0b26753a6bf6208e5d69d80db0c04dcd776832d054c5
+      uri: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf/resolve/main/gemma-4-26B_q4_0-it.gguf
+    - filename: llama-cpp/mmproj/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B-it-mmproj.gguf
+      sha256: d8e2de16e17515d9061b23c9a002715f996f9e0c87b93a9354264611bfab9239
+      uri: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf/resolve/main/gemma-4-26B-it-mmproj.gguf
+- name: "gemma-4-31b-it-qat-q4_0"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf
+  description: |
+    Gemma 4 31B is the largest dense multimodal (text + image) instruction-tuned model in the Gemma 4 family from Google DeepMind, optimized with Quantization-Aware Training (QAT) to preserve bfloat16-level quality while dramatically reducing the memory required to load the model. This is the official Google Q4_0 GGUF, shipped with its multimodal projector.
+
+    License: Apache 2.0 | Authors: Google DeepMind
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qat
+    - multimodal
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B-it-mmproj.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B_q4_0-it.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B_q4_0-it.gguf
+      sha256: 0374ce7b0124db9ba96fc649e835c531223ee224a497ce88a374baaea10932ec
+      uri: https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf/resolve/main/gemma-4-31B_q4_0-it.gguf
+    - filename: llama-cpp/mmproj/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B-it-mmproj.gguf
+      sha256: 8e239c9c592541c9f537fff75677ea30d8af1e14ba63d27cf245423b7d0a688b
+      uri: https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf/resolve/main/gemma-4-31B-it-mmproj.gguf
+- name: "gemma-4-12b-it-qat-mtp"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf
+    - https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF
+  description: |
+    Gemma 4 12B IT QAT (Google DeepMind) paired with the official QAT assistant/drafter head for Multi-Token Prediction (MTP) speculative decoding. The Q4_0 target carries the full multimodal (text + image) model, while the Q8_0 assistant GGUF (from Janvitos, converted from Google's `gemma-4-12B-it-qat-q4_0-unquantized-assistant` checkpoint) acts as the draft model. With llama.cpp's `draft-mtp` speculative path enabled, this combination accelerates generation while keeping the target model's quality. The assistant head is not a standalone chat model: it only runs paired with the target, which is why both are bundled here.
+
+    License: Apache 2.0 | Authors: Google DeepMind (target/assistant checkpoints), Janvitos (GGUF conversion)
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qat
+    - multimodal
+    - mtp
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-12B-it-qat-q4_0-gguf/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+    draft_model: llama-cpp/models/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF/gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf
+    options:
+      - use_jinja:true
+      - spec_type:draft-mtp
+      - spec_n_max:6
+      - spec_p_min:0.75
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-12B-it-qat-q4_0-gguf/gemma-4-12b-it-qat-q4_0.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-12B-it-qat-q4_0-gguf/gemma-4-12b-it-qat-q4_0.gguf
+      sha256: faff1a63667fac17ac5e777f47114688fcefea96e220e211aaa8d62c2c4561f1
+      uri: https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/gemma-4-12b-it-qat-q4_0.gguf
+    - filename: llama-cpp/mmproj/gemma-4-12B-it-qat-q4_0-gguf/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+      sha256: e70b0e5cd80323d5d588b4ed06780356b7b1ba03995a4b8164c6ae9db0ff5989
+      uri: https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+    - filename: llama-cpp/models/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF/gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf
+      sha256: 13331068b6af643c3dc75e619373b674c1f75a1958e7c82e2020d96a17c63809
+      uri: https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF/resolve/main/gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf
+- name: "gemma-4-26b-a4b-it-qat-mtp"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf
+    - https://huggingface.co/boxwrench/gemma-4-qat-mtp-assistant-heads
+  description: |
+    Gemma 4 26B-A4B IT QAT (Google DeepMind), a multimodal Mixture-of-Experts model (26B total, ~4B active per token), paired with the QAT-matched MTP assistant/drafter head for Multi-Token Prediction speculative decoding. The Q4_0 target carries the full multimodal (text + image) model, while the Q8_0 assistant GGUF (from boxwrench, converted from Google's `gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant` checkpoint) acts as the draft model. Using a QAT-matched head instead of a generic non-QAT head raised draft acceptance from ~57% to ~92% on this model. The assistant head is not a standalone chat model: it only runs paired with the target, which is why both are bundled here.
+
+    > [!Note]
+    > The assistant head uses the `gemma4_assistant` architecture. It loads on the Atomic TurboQuant llama.cpp fork and on stock llama.cpp once ggml-org/llama.cpp#23398 ("llama: add Gemma4 MTP") merges. Until the upstream `n_tokens` reshape fix lands, run with a single parallel slot.
+
+    License: Apache 2.0 | Authors: Google DeepMind (target/assistant checkpoints), boxwrench (GGUF conversion)
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qat
+    - multimodal
+    - moe
+    - mtp
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B-it-mmproj.gguf
+    draft_model: llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf
+    options:
+      - use_jinja:true
+      - spec_type:draft-mtp
+      - spec_n_max:6
+      - spec_p_min:0.75
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B_q4_0-it.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B_q4_0-it.gguf
+      sha256: 4c856523d61d77922dbc0b26753a6bf6208e5d69d80db0c04dcd776832d054c5
+      uri: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf/resolve/main/gemma-4-26B_q4_0-it.gguf
+    - filename: llama-cpp/mmproj/gemma-4-26B-A4B-it-qat-q4_0-gguf/gemma-4-26B-it-mmproj.gguf
+      sha256: d8e2de16e17515d9061b23c9a002715f996f9e0c87b93a9354264611bfab9239
+      uri: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gguf/resolve/main/gemma-4-26B-it-mmproj.gguf
+    - filename: llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf
+      sha256: 86f156403d9148aeffa765411f1373d1a2f9c840d62f5e088701153a35ecff73
+      uri: https://huggingface.co/boxwrench/gemma-4-qat-mtp-assistant-heads/resolve/main/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf
+- name: "gemma-4-31b-it-qat-mtp"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf
+    - https://huggingface.co/boxwrench/gemma-4-qat-mtp-assistant-heads
+  description: |
+    Gemma 4 31B IT QAT (Google DeepMind), the largest dense multimodal model in the family, paired with the QAT-matched MTP assistant/drafter head for Multi-Token Prediction speculative decoding. The Q4_0 target carries the full multimodal (text + image) model, while the Q8_0 assistant GGUF (from boxwrench, converted from Google's `gemma-4-31B-it-qat-q4_0-unquantized-assistant` checkpoint) acts as the draft model. Using a QAT-matched head instead of a generic non-QAT head substantially raises draft acceptance and end-to-end throughput. The assistant head is not a standalone chat model: it only runs paired with the target, which is why both are bundled here.
+
+    > [!Note]
+    > The assistant head uses the `gemma4_assistant` architecture. It loads on the Atomic TurboQuant llama.cpp fork and on stock llama.cpp once ggml-org/llama.cpp#23398 ("llama: add Gemma4 MTP") merges. Until the upstream `n_tokens` reshape fix lands, run with a single parallel slot.
+
+    License: Apache 2.0 | Authors: Google DeepMind (target/assistant checkpoints), boxwrench (GGUF conversion)
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qat
+    - multimodal
+    - mtp
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B-it-mmproj.gguf
+    draft_model: llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf
+    options:
+      - use_jinja:true
+      - spec_type:draft-mtp
+      - spec_n_max:6
+      - spec_p_min:0.75
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B_q4_0-it.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B_q4_0-it.gguf
+      sha256: 0374ce7b0124db9ba96fc649e835c531223ee224a497ce88a374baaea10932ec
+      uri: https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf/resolve/main/gemma-4-31B_q4_0-it.gguf
+    - filename: llama-cpp/mmproj/gemma-4-31B-it-qat-q4_0-gguf/gemma-4-31B-it-mmproj.gguf
+      sha256: 8e239c9c592541c9f537fff75677ea30d8af1e14ba63d27cf245423b7d0a688b
+      uri: https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf/resolve/main/gemma-4-31B-it-mmproj.gguf
+    - filename: llama-cpp/models/gemma-4-qat-mtp-assistant-heads/gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf
+      sha256: 7a7cdd65a93536f3bf324e97ddf60cc8d482510eaa0837873aef0fd7e0b493a5
+      uri: https://huggingface.co/boxwrench/gemma-4-qat-mtp-assistant-heads/resolve/main/gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf
 - name: "step-3.7-flash"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls: