mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-03 13:56:46 -04:00
The split_mode: tensor description claimed tensor parallelism requires KV-cache quantization to be disabled. ggml-org/llama.cpp#23792 lifts that restriction by extending the meta backend to preserve shape information through KV-cache flatten/reshape, so cache_type_k/cache_type_v quantization can be combined with -sm tensor on builds that include it. Documentation only: no backend code, grpc-server.cpp comment, or llama.cpp pin changes. Assisted-by: Claude Code:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>