feat: disable force eviction (#7725)

* feat: allow to set forcing backends eviction while requests are in flight

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: try to make the request sit and retry if eviction couldn't be done

Otherwise calls that in order to pass would need to shutdown other
backends would just fail.

In this way instead we make the request sit and retry eviction until it
succeeds. The thresholds can be configured by the user.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* add tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* expose settings to CLI

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2025-12-25 14:26:18 +01:00
committed by GitHub
parent bb459e671f
commit c844b7ac58
18 changed files with 739 additions and 41 deletions

View File

@@ -52,6 +52,49 @@ Setting the limit to `1` is equivalent to single active backend mode (see below)
3. The LRU model(s) are automatically unloaded to make room for the new model
4. Concurrent requests for loading different models are handled safely - the system accounts for models currently being loaded when calculating evictions
### Eviction Behavior with Active Requests
By default, LocalAI will **skip evicting models that have active API calls** to prevent interrupting ongoing requests. This means:
- If all models are busy (have active requests), eviction will be skipped and the system will wait for models to become idle
- The loading request will retry eviction with configurable retry settings
- This ensures data integrity and prevents request failures
You can configure this behavior via WebUI or using the following settings:
#### Force Eviction When Busy
To allow evicting models even when they have active API calls (not recommended for production):
```bash
# Via CLI
./local-ai --force-eviction-when-busy
# Via environment variable
LOCALAI_FORCE_EVICTION_WHEN_BUSY=true ./local-ai
```
> **Warning:** Enabling force eviction can interrupt active requests and cause errors. Only use this if you understand the implications.
#### LRU Eviction Retry Settings
When models are busy and cannot be evicted, LocalAI will retry eviction with configurable settings:
```bash
# Configure maximum retries (default: 30)
./local-ai --lru-eviction-max-retries=50
# Configure retry interval (default: 1s)
./local-ai --lru-eviction-retry-interval=2s
# Using environment variables
LOCALAI_LRU_EVICTION_MAX_RETRIES=50 \
LOCALAI_LRU_EVICTION_RETRY_INTERVAL=2s \
./local-ai
```
These settings control how long the system will wait for busy models to become idle before giving up. The retry mechanism allows busy models to complete their requests before being evicted, preventing request failures.
### Example
```bash
@@ -207,6 +250,33 @@ This configuration:
- Automatically unloads any model that hasn't been used for 15 minutes
- Provides both hard limits and time-based cleanup
### Example with Retry Settings
You can also configure retry behavior when models are busy:
```bash
# Allow up to 2 active backends with custom retry settings
LOCALAI_MAX_ACTIVE_BACKENDS=2 \
LOCALAI_LRU_EVICTION_MAX_RETRIES=50 \
LOCALAI_LRU_EVICTION_RETRY_INTERVAL=2s \
./local-ai
```
Or using command line flags:
```bash
./local-ai \
--max-active-backends=2 \
--lru-eviction-max-retries=50 \
--lru-eviction-retry-interval=2s
```
This configuration:
- Limits to 2 active backends
- Will retry eviction up to 50 times if models are busy
- Waits 2 seconds between retry attempts
- Ensures busy models have time to complete their requests before eviction
## Limitations and Considerations
### VRAM Usage Estimation

View File

@@ -29,9 +29,23 @@ Changes to watchdog settings are applied immediately by restarting the watchdog
- **Max Active Backends**: Maximum number of active backends (loaded models). When exceeded, the least recently used model is automatically evicted. Set to `0` for unlimited, `1` for single-backend mode
- **Parallel Backend Requests**: Enable backends to handle multiple requests in parallel if supported
- **Force Eviction When Busy**: Allow evicting models even when they have active API calls (default: disabled for safety). **Warning:** Enabling this can interrupt active requests
- **LRU Eviction Max Retries**: Maximum number of retries when waiting for busy models to become idle before eviction (default: 30)
- **LRU Eviction Retry Interval**: Interval between retries when waiting for busy models (default: `1s`)
> **Note:** The "Single Backend" setting is deprecated. Use "Max Active Backends" set to `1` for single-backend behavior.
#### LRU Eviction Behavior
By default, LocalAI will skip evicting models that have active API calls to prevent interrupting ongoing requests. When all models are busy and eviction is needed:
1. The system will wait for models to become idle
2. It will retry eviction up to the configured maximum number of retries
3. The retry interval determines how long to wait between attempts
4. If all retries are exhausted, the system will proceed (which may cause out-of-memory errors if resources are truly exhausted)
You can configure these settings via the web UI or through environment variables. See [VRAM Management]({{%relref "advanced/vram-management" %}}) for more details.
### Performance Settings
- **Threads**: Number of threads used for parallel computation (recommended: number of physical cores)
@@ -94,6 +108,9 @@ The `runtime_settings.json` file follows this structure:
"watchdog_busy_timeout": "5m",
"max_active_backends": 0,
"parallel_backend_requests": true,
"force_eviction_when_busy": false,
"lru_eviction_max_retries": 30,
"lru_eviction_retry_interval": "1s",
"threads": 8,
"context_size": 2048,
"f16": false,

View File

@@ -128,7 +128,7 @@ Future versions of LocalAI will expose additional control over audio generation
#### Setup
Install the `vibevoice` model in the Model gallery.
Install the `vibevoice` model in the Model gallery or run `local-ai run models install vibevoice`.
#### Usage

View File

@@ -46,6 +46,9 @@ Complete reference for all LocalAI command-line interface (CLI) parameters and e
| `--watchdog-idle-timeout` | `15m` | Threshold beyond which an idle backend should be stopped | `$LOCALAI_WATCHDOG_IDLE_TIMEOUT`, `$WATCHDOG_IDLE_TIMEOUT` |
| `--enable-watchdog-busy` | `false` | Enable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout | `$LOCALAI_WATCHDOG_BUSY`, `$WATCHDOG_BUSY` |
| `--watchdog-busy-timeout` | `5m` | Threshold beyond which a busy backend should be stopped | `$LOCALAI_WATCHDOG_BUSY_TIMEOUT`, `$WATCHDOG_BUSY_TIMEOUT` |
| `--force-eviction-when-busy` | `false` | Force eviction even when models have active API calls (default: false for safety). **Warning:** Enabling this can interrupt active requests | `$LOCALAI_FORCE_EVICTION_WHEN_BUSY`, `$FORCE_EVICTION_WHEN_BUSY` |
| `--lru-eviction-max-retries` | `30` | Maximum number of retries when waiting for busy models to become idle before eviction | `$LOCALAI_LRU_EVICTION_MAX_RETRIES`, `$LRU_EVICTION_MAX_RETRIES` |
| `--lru-eviction-retry-interval` | `1s` | Interval between retries when waiting for busy models to become idle (e.g., `1s`, `2s`) | `$LOCALAI_LRU_EVICTION_RETRY_INTERVAL`, `$LRU_EVICTION_RETRY_INTERVAL` |
For more information on VRAM management, see [VRAM and Memory Management]({{%relref "advanced/vram-management" %}}).