* mlx: add laguna model support
* convert: support fp8 safetensors import
Decode HF F8_E4M3 safetensors with block scale companions into GGUF-supported tensor types, and record which output tensors came from FP8 source weights.
Use that source-precision metadata during create quantization: default FP8-sourced GGUFs to Q8_0, keep non-FP8 tensors at their original precision for Q8_0, and promote non-FP8 quantizable tensors to Q8_0 for Q4_K requests.
* ggml: add laguna model support
* server: preserve generate logprobs with builtin parsers
Generate requests were dropping logprob-only chunks whenever a builtin parser buffered visible content. Chat already handled this case, but generate only forwarded chunks with visible response, thinking, or tool-call output.
Keep generate chunks that carry logprobs even when the builtin parser has not flushed visible content yet, and add a regression test that exercises the behavior with a generic thinking parser.
* review comments - perf improvements
* ggml: implement nemotron 3 nano omni
* add poolside integration
* update poolside doc
* adapt to new cache setup
* fix test
* fix test
---------
Co-authored-by: Eva Ho <hoyyeva@gmail.com>
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
- Create backend, including enumerating tensors and memory allocation
- Loading tensor data
This allows more flexibility in managing model loading.
The quantization PR didn't block all unsupported file types,
which this PR fixes. It also updates the API docs to reflect
the now reduced set of supported types.
* Move quantization logic to GGML via new backend
This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
* Remove "add model quantizations"
This is no longer needed now that quantization is implemented in Go+GGML code directly.