feat: disable force eviction (#7725)

* feat: allow to set forcing backends eviction while requests are in flight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: try to make the request sit and retry if eviction couldn't be done Otherwise calls that in order to pass would need to shutdown other backends would just fail. In this way instead we make the request sit and retry eviction until it succeeds. The thresholds can be configured by the user. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * add tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * expose settings to CLI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-13 09:44:26 -04:00 · 2025-12-25 14:26:18 +01:00
parent bb459e671f
commit c844b7ac58
18 changed files with 739 additions and 41 deletions
--- a/docs/content/advanced/vram-management.md
+++ b/docs/content/advanced/vram-management.md
@@ -52,6 +52,49 @@ Setting the limit to `1` is equivalent to single active backend mode (see below)
 3. The LRU model(s) are automatically unloaded to make room for the new model
 4. Concurrent requests for loading different models are handled safely - the system accounts for models currently being loaded when calculating evictions

+### Eviction Behavior with Active Requests
+
+By default, LocalAI will **skip evicting models that have active API calls** to prevent interrupting ongoing requests. This means:
+
+- If all models are busy (have active requests), eviction will be skipped and the system will wait for models to become idle
+- The loading request will retry eviction with configurable retry settings
+- This ensures data integrity and prevents request failures
+
+You can configure this behavior via WebUI or using the following settings:
+
+#### Force Eviction When Busy
+
+To allow evicting models even when they have active API calls (not recommended for production):
+
+```bash
+# Via CLI
+./local-ai --force-eviction-when-busy
+
+# Via environment variable
+LOCALAI_FORCE_EVICTION_WHEN_BUSY=true ./local-ai
+```
+
+> **Warning:** Enabling force eviction can interrupt active requests and cause errors. Only use this if you understand the implications.
+
+#### LRU Eviction Retry Settings
+
+When models are busy and cannot be evicted, LocalAI will retry eviction with configurable settings:
+
+```bash
+# Configure maximum retries (default: 30)
+./local-ai --lru-eviction-max-retries=50
+
+# Configure retry interval (default: 1s)
+./local-ai --lru-eviction-retry-interval=2s
+
+# Using environment variables
+LOCALAI_LRU_EVICTION_MAX_RETRIES=50 \
+LOCALAI_LRU_EVICTION_RETRY_INTERVAL=2s \
+./local-ai
+```
+
+These settings control how long the system will wait for busy models to become idle before giving up. The retry mechanism allows busy models to complete their requests before being evicted, preventing request failures.
+
 ### Example

 ```bash
@@ -207,6 +250,33 @@ This configuration:
 - Automatically unloads any model that hasn't been used for 15 minutes
 - Provides both hard limits and time-based cleanup

+### Example with Retry Settings
+
+You can also configure retry behavior when models are busy:
+
+```bash
+# Allow up to 2 active backends with custom retry settings
+LOCALAI_MAX_ACTIVE_BACKENDS=2 \
+LOCALAI_LRU_EVICTION_MAX_RETRIES=50 \
+LOCALAI_LRU_EVICTION_RETRY_INTERVAL=2s \
+./local-ai
+```
+
+Or using command line flags:
+
+```bash
+./local-ai \
+  --max-active-backends=2 \
+  --lru-eviction-max-retries=50 \
+  --lru-eviction-retry-interval=2s
+```
+
+This configuration:
+- Limits to 2 active backends
+- Will retry eviction up to 50 times if models are busy
+- Waits 2 seconds between retry attempts
+- Ensures busy models have time to complete their requests before eviction
+
 ## Limitations and Considerations

 ### VRAM Usage Estimation
--- a/docs/content/features/runtime-settings.md
+++ b/docs/content/features/runtime-settings.md
@@ -29,9 +29,23 @@ Changes to watchdog settings are applied immediately by restarting the watchdog

 - **Max Active Backends**: Maximum number of active backends (loaded models). When exceeded, the least recently used model is automatically evicted. Set to `0` for unlimited, `1` for single-backend mode
 - **Parallel Backend Requests**: Enable backends to handle multiple requests in parallel if supported
+- **Force Eviction When Busy**: Allow evicting models even when they have active API calls (default: disabled for safety). **Warning:** Enabling this can interrupt active requests
+- **LRU Eviction Max Retries**: Maximum number of retries when waiting for busy models to become idle before eviction (default: 30)
+- **LRU Eviction Retry Interval**: Interval between retries when waiting for busy models (default: `1s`)

 > **Note:** The "Single Backend" setting is deprecated. Use "Max Active Backends" set to `1` for single-backend behavior.

+#### LRU Eviction Behavior
+
+By default, LocalAI will skip evicting models that have active API calls to prevent interrupting ongoing requests. When all models are busy and eviction is needed:
+
+1. The system will wait for models to become idle
+2. It will retry eviction up to the configured maximum number of retries
+3. The retry interval determines how long to wait between attempts
+4. If all retries are exhausted, the system will proceed (which may cause out-of-memory errors if resources are truly exhausted)
+
+You can configure these settings via the web UI or through environment variables. See [VRAM Management]({{%relref "advanced/vram-management" %}}) for more details.
+
 ### Performance Settings

 - **Threads**: Number of threads used for parallel computation (recommended: number of physical cores)
@@ -94,6 +108,9 @@ The `runtime_settings.json` file follows this structure:
  "watchdog_busy_timeout": "5m",
  "max_active_backends": 0,
  "parallel_backend_requests": true,
+  "force_eviction_when_busy": false,
+  "lru_eviction_max_retries": 30,
+  "lru_eviction_retry_interval": "1s",
  "threads": 8,
  "context_size": 2048,
  "f16": false,
--- a/docs/content/features/text-to-audio.md
+++ b/docs/content/features/text-to-audio.md
@@ -128,7 +128,7 @@ Future versions of LocalAI will expose additional control over audio generation

 #### Setup

-Install the `vibevoice` model in the Model gallery.
+Install the `vibevoice` model in the Model gallery or run `local-ai run models install vibevoice`.

 #### Usage

--- a/docs/content/reference/cli-reference.md
+++ b/docs/content/reference/cli-reference.md
@@ -46,6 +46,9 @@ Complete reference for all LocalAI command-line interface (CLI) parameters and e
 | `--watchdog-idle-timeout` | `15m` | Threshold beyond which an idle backend should be stopped | `$LOCALAI_WATCHDOG_IDLE_TIMEOUT`, `$WATCHDOG_IDLE_TIMEOUT` |
 | `--enable-watchdog-busy` | `false` | Enable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout | `$LOCALAI_WATCHDOG_BUSY`, `$WATCHDOG_BUSY` |
 | `--watchdog-busy-timeout` | `5m` | Threshold beyond which a busy backend should be stopped | `$LOCALAI_WATCHDOG_BUSY_TIMEOUT`, `$WATCHDOG_BUSY_TIMEOUT` |
+| `--force-eviction-when-busy` | `false` | Force eviction even when models have active API calls (default: false for safety). **Warning:** Enabling this can interrupt active requests | `$LOCALAI_FORCE_EVICTION_WHEN_BUSY`, `$FORCE_EVICTION_WHEN_BUSY` |
+| `--lru-eviction-max-retries` | `30` | Maximum number of retries when waiting for busy models to become idle before eviction | `$LOCALAI_LRU_EVICTION_MAX_RETRIES`, `$LRU_EVICTION_MAX_RETRIES` |
+| `--lru-eviction-retry-interval` | `1s` | Interval between retries when waiting for busy models to become idle (e.g., `1s`, `2s`) | `$LOCALAI_LRU_EVICTION_RETRY_INTERVAL`, `$LRU_EVICTION_RETRY_INTERVAL` |

 For more information on VRAM management, see [VRAM and Memory Management]({{%relref "advanced/vram-management" %}}).