mirror of
https://github.com/mudler/LocalAI.git
synced 2026-03-31 21:25:59 -04:00
* docs: Add documentation about GPU auto-fit mode limitations (closes #8562) - Document the default gpu_layers behavior (9999999) that disables auto-fit - Explain the trade-off between auto-fit and VRAM threshold unloading - Add recommendations for users who want to enable gpu_layers: -1 - Note known issues with tensor_buft_override buffer errors - Link to issue #8562 for future improvements Signed-off-by: team-coding-agent-1 <team-coding-agent-1@localai.dev> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: team-coding-agent-1 <team-coding-agent-1@localai.dev> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: team-coding-agent-1 <team-coding-agent-1@localai.dev> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
This commit is contained in:
@@ -610,3 +610,20 @@ feature_flags:
|
||||
- See [Prompt Templates]({{%relref "advanced/advanced-usage#prompt-templates" %}}) for template examples
|
||||
- See [CLI Reference]({{%relref "reference/cli-reference" %}}) for command-line options
|
||||
|
||||
|
||||
### GPU Auto-Fit Mode
|
||||
|
||||
**Note**: By default, LocalAI sets `gpu_layers` to a very large value (9999999), which effectively disables llama-cpp's auto-fit functionality. This is intentional to work with LocalAI's VRAM-based model unloading mechanism.
|
||||
|
||||
To enable llama-cpp's auto-fit mode, set `gpu_layers: -1` in your model configuration. However, be aware of the following:
|
||||
|
||||
1. **Trade-off**: Enabling auto-fit conflicts with LocalAI's built-in VRAM threshold-based unloading. Auto-fit attempts to fit all tensors into GPU memory automatically, while LocalAI's unloading mechanism removes models when VRAM usage exceeds thresholds.
|
||||
|
||||
2. **Known Issues**: Setting `gpu_layers: -1` may trigger `tensor_buft_override` buffer errors in some configurations, particularly when the model exceeds available GPU memory.
|
||||
|
||||
3. **Recommendation**:
|
||||
- Use the default settings for most use cases (LocalAI manages VRAM automatically)
|
||||
- Only enable `gpu_layers: -1` if you understand the implications and have tested on your specific hardware
|
||||
- Monitor VRAM usage carefully when using auto-fit mode
|
||||
|
||||
This is a known limitation being tracked in issue [#8562](https://github.com/mudler/LocalAI/issues/8562). A future implementation may provide a runtime toggle or custom logic to reconcile auto-fit with threshold-based unloading.
|
||||
|
||||
Reference in New Issue
Block a user