mirror of
https://github.com/bentoml/OpenLLM.git
synced 2026-06-11 09:59:20 -04:00
feat(models): command-r (#1005)
* feat(models): add support for command-r Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com> * feat(models): support command-r and remove deadcode and extensions Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com> * chore: update local.sh script Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com> --------- Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com>
This commit is contained in:
356
openllm-python/README.md
generated
356
openllm-python/README.md
generated
@@ -101,24 +101,16 @@ OpenLLM currently supports the following models. By default, OpenLLM doesn't inc
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Baichuan requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[baichuan]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Baichuan server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start baichuan-inc/baichuan-7b
|
||||
openllm start baichuan-inc/baichuan-7b --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -145,24 +137,16 @@ You can specify any of the following Baichuan models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** ChatGLM requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[chatglm]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a ChatGLM server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start thudm/chatglm-6b
|
||||
openllm start thudm/chatglm-6b --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -186,29 +170,55 @@ You can specify any of the following ChatGLM models via `openllm start`:
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Cohere</summary>
|
||||
|
||||
|
||||
### Quickstart
|
||||
|
||||
Run the following command to quickly spin up a Cohere server:
|
||||
|
||||
```bash
|
||||
openllm start CohereForAI/c4ai-command-r-plus --trust-remote-code
|
||||
```
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
> **Note:** Any Cohere variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=commandr) to see more Cohere-compatible models.
|
||||
|
||||
|
||||
|
||||
### Supported models
|
||||
|
||||
You can specify any of the following Cohere models via `openllm start`:
|
||||
|
||||
|
||||
- [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus)
|
||||
- [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
||||
<summary>Dbrx</summary>
|
||||
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Dbrx requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[dbrx]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Dbrx server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start databricks/dbrx-instruct
|
||||
openllm start databricks/dbrx-instruct --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -236,13 +246,13 @@ You can specify any of the following Dbrx models via `openllm start`:
|
||||
Run the following command to quickly spin up a DollyV2 server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start databricks/dolly-v2-3b
|
||||
openllm start databricks/dolly-v2-3b --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -268,24 +278,16 @@ You can specify any of the following DollyV2 models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Falcon requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[falcon]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Falcon server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start tiiuae/falcon-7b
|
||||
openllm start tiiuae/falcon-7b --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -312,24 +314,16 @@ You can specify any of the following Falcon models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Gemma requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[gemma]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Gemma server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start google/gemma-7b
|
||||
openllm start google/gemma-7b --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -359,13 +353,13 @@ You can specify any of the following Gemma models via `openllm start`:
|
||||
Run the following command to quickly spin up a GPTNeoX server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start eleutherai/gpt-neox-20b
|
||||
openllm start eleutherai/gpt-neox-20b --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -389,24 +383,16 @@ You can specify any of the following GPTNeoX models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Llama requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[llama]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Llama server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start NousResearch/llama-2-7b-hf
|
||||
openllm start NousResearch/llama-2-7b-hf --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -441,24 +427,16 @@ You can specify any of the following Llama models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Mistral requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[mistral]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Mistral server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start mistralai/Mistral-7B-Instruct-v0.1
|
||||
openllm start mistralai/Mistral-7B-Instruct-v0.1 --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -486,24 +464,16 @@ You can specify any of the following Mistral models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Mixtral requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[mixtral]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Mixtral server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start mistralai/Mixtral-8x7B-Instruct-v0.1
|
||||
openllm start mistralai/Mixtral-8x7B-Instruct-v0.1 --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -528,24 +498,16 @@ You can specify any of the following Mixtral models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** MPT requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[mpt]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a MPT server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start mosaicml/mpt-7b-instruct
|
||||
openllm start mosaicml/mpt-7b-instruct --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -575,24 +537,16 @@ You can specify any of the following MPT models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** OPT requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[opt]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a OPT server:
|
||||
|
||||
```bash
|
||||
openllm start facebook/opt-1.3b
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -621,24 +575,16 @@ You can specify any of the following OPT models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Phi requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[phi]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Phi server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start microsoft/Phi-3-mini-4k-instruct
|
||||
openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -667,24 +613,16 @@ You can specify any of the following Phi models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Qwen requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[qwen]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Qwen server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start qwen/Qwen-7B-Chat
|
||||
openllm start qwen/Qwen-7B-Chat --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -713,24 +651,16 @@ You can specify any of the following Qwen models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** StableLM requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[stablelm]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a StableLM server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start stabilityai/stablelm-tuned-alpha-3b
|
||||
openllm start stabilityai/stablelm-tuned-alpha-3b --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -757,24 +687,16 @@ You can specify any of the following StableLM models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** StarCoder requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[starcoder]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a StarCoder server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start bigcode/starcoder
|
||||
openllm start bigcode/starcoder --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
@@ -799,24 +721,16 @@ You can specify any of the following StarCoder models via `openllm start`:
|
||||
|
||||
### Quickstart
|
||||
|
||||
|
||||
|
||||
> **Note:** Yi requires to install with:
|
||||
> ```bash
|
||||
> pip install "openllm[yi]"
|
||||
> ```
|
||||
|
||||
|
||||
Run the following command to quickly spin up a Yi server:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start 01-ai/Yi-6B
|
||||
openllm start 01-ai/Yi-6B --trust-remote-code
|
||||
```
|
||||
In a different terminal, run the following command to interact with the server:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'What are large language models?'
|
||||
You can run the following code in a different terminal to interact with the server:
|
||||
```python
|
||||
import openllm_client
|
||||
client = openllm_client.HTTPClient('http://localhost:3000')
|
||||
client.generate('What are large language models?')
|
||||
```
|
||||
|
||||
|
||||
|
||||
@@ -39,21 +39,16 @@ classifiers = [
|
||||
]
|
||||
dependencies = [
|
||||
"bentoml[io]>=1.2.16",
|
||||
"transformers[torch,tokenizers]>=4.36.0",
|
||||
"openllm-client>=0.5.4",
|
||||
"openllm-core>=0.5.4",
|
||||
"safetensors",
|
||||
"vllm>=0.4.2",
|
||||
"optimum>=1.12.0",
|
||||
"accelerate",
|
||||
"ghapi",
|
||||
"einops",
|
||||
"sentencepiece",
|
||||
"scipy",
|
||||
"build[virtualenv]<1",
|
||||
"click>=8.1.3",
|
||||
"cuda-python;platform_system!=\"Darwin\"",
|
||||
"bitsandbytes<0.42",
|
||||
]
|
||||
description = "OpenLLM: Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud."
|
||||
dynamic = ["version", "readme"]
|
||||
@@ -94,38 +89,6 @@ Homepage = "https://bentoml.com"
|
||||
Tracker = "https://github.com/bentoml/OpenLLM/issues"
|
||||
Twitter = "https://twitter.com/bentomlai"
|
||||
|
||||
[project.optional-dependencies]
|
||||
agents = ["transformers[agents]>=4.36.0", "diffusers", "soundfile"]
|
||||
all = ["openllm[full]"]
|
||||
awq = ["autoawq"]
|
||||
baichuan = ["cpm-kernels"]
|
||||
chatglm = ["cpm-kernels"]
|
||||
dbrx = ["cpm-kernels"]
|
||||
dolly-v2 = ["cpm-kernels"]
|
||||
falcon = ["xformers"]
|
||||
fine-tune = ["peft>=0.6.0", "datasets", "trl", "huggingface-hub"]
|
||||
full = [
|
||||
"openllm[agents,awq,baichuan,chatglm,dbrx,dolly-v2,falcon,fine-tune,gemma,ggml,gpt-neox,gptq,grpc,llama,mistral,mixtral,mpt,openai,opt,phi,playground,qwen,stablelm,starcoder,vllm,yi]",
|
||||
]
|
||||
gemma = ["xformers"]
|
||||
ggml = ["ctransformers"]
|
||||
gpt-neox = ["xformers"]
|
||||
gptq = ["auto-gptq[triton]>=0.4.2"]
|
||||
grpc = ["bentoml[grpc]>=1.2.16"]
|
||||
llama = ["xformers"]
|
||||
mistral = ["xformers"]
|
||||
mixtral = ["xformers"]
|
||||
mpt = ["triton"]
|
||||
openai = ["openai[datalib]>=1", "tiktoken", "fastapi"]
|
||||
opt = ["triton"]
|
||||
phi = ["triton"]
|
||||
playground = ["jupyter", "notebook", "ipython", "jupytext", "nbformat"]
|
||||
qwen = ["cpm-kernels", "tiktoken"]
|
||||
stablelm = ["cpm-kernels", "tiktoken"]
|
||||
starcoder = ["bitsandbytes"]
|
||||
vllm = ["vllm==0.4.2"]
|
||||
yi = ["bitsandbytes"]
|
||||
|
||||
[tool.hatch.version]
|
||||
fallback-version = "0.0.0"
|
||||
source = "vcs"
|
||||
|
||||
@@ -105,7 +105,7 @@ def optimization_decorator(fn: t.Callable[..., t.Any]):
|
||||
'--quantise',
|
||||
'--quantize',
|
||||
'quantise',
|
||||
type=str,
|
||||
type=click.Choice(get_literal_args(LiteralQuantise)),
|
||||
default=None,
|
||||
envvar='QUANTIZE',
|
||||
show_envvar=True,
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import inspect, orjson, dataclasses, bentoml, functools, attr, openllm_core, traceback, openllm, typing as t
|
||||
import inspect, orjson, logging, dataclasses, bentoml, functools, attr, os, openllm_core, traceback, openllm, typing as t
|
||||
|
||||
from openllm_core.utils import (
|
||||
get_debug_mode,
|
||||
@@ -10,11 +10,13 @@ from openllm_core.utils import (
|
||||
dict_filter_none,
|
||||
Counter,
|
||||
)
|
||||
from openllm_core._typing_compat import LiteralQuantise, LiteralSerialisation, LiteralDtype
|
||||
from openllm_core._typing_compat import LiteralQuantise, LiteralSerialisation, LiteralDtype, get_literal_args
|
||||
from openllm_core._schemas import GenerationOutput
|
||||
|
||||
Dtype = t.Union[LiteralDtype, t.Literal['auto', 'half', 'float']]
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
if t.TYPE_CHECKING:
|
||||
from vllm import AsyncEngineArgs, EngineArgs, RequestOutput
|
||||
|
||||
@@ -30,11 +32,24 @@ def check_engine_args(_, attr: attr.Attribute[dict[str, t.Any]], v: dict[str, t.
|
||||
|
||||
|
||||
def check_quantization(_, attr: attr.Attribute[LiteralQuantise], v: str | None) -> LiteralQuantise | None:
|
||||
if v is not None and v not in {'gptq', 'awq', 'squeezellm'}:
|
||||
if v is not None and v not in get_literal_args(LiteralQuantise):
|
||||
raise ValueError(f'Invalid quantization method: {v}')
|
||||
return v
|
||||
|
||||
|
||||
def update_engine_args(v: t.Dict[str, t.Any]) -> t.Dict[str, t.Any]:
|
||||
env_json_string = os.environ.get('ENGINE_CONFIG', None)
|
||||
|
||||
config_from_env = {}
|
||||
if env_json_string is not None:
|
||||
try:
|
||||
config_from_env = orjson.loads(env_json_string)
|
||||
except orjson.JSONDecodeError as e:
|
||||
raise RuntimeError("Failed to parse 'ENGINE_CONFIG' as valid JSON string.") from e
|
||||
config_from_env.update(v)
|
||||
return config_from_env
|
||||
|
||||
|
||||
@attr.define(init=False)
|
||||
class LLM:
|
||||
model_id: str
|
||||
@@ -44,7 +59,7 @@ class LLM:
|
||||
dtype: Dtype
|
||||
quantise: t.Optional[LiteralQuantise] = attr.field(default=None, validator=check_quantization)
|
||||
trust_remote_code: bool = attr.field(default=False)
|
||||
engine_args: t.Dict[str, t.Any] = attr.field(factory=dict, validator=check_engine_args)
|
||||
engine_args: t.Dict[str, t.Any] = attr.field(factory=dict, validator=check_engine_args, converter=update_engine_args)
|
||||
|
||||
_mode: t.Literal['batch', 'async'] = attr.field(default='async', repr=False)
|
||||
_path: str = attr.field(
|
||||
@@ -117,18 +132,27 @@ class LLM:
|
||||
num_gpus, dev = 1, openllm.utils.device_count()
|
||||
if dev >= 2:
|
||||
num_gpus = min(dev // 2 * 2, dev)
|
||||
dtype = 'float16' if self.quantise == 'gptq' else self.dtype # NOTE: quantise GPTQ doesn't support bfloat16 yet.
|
||||
|
||||
self.engine_args.update({
|
||||
'worker_use_ray': False,
|
||||
'tokenizer_mode': 'auto',
|
||||
overriden_dict = {
|
||||
'tensor_parallel_size': num_gpus,
|
||||
'model': self._path,
|
||||
'tokenizer': self._path,
|
||||
'trust_remote_code': self.trust_remote_code,
|
||||
'dtype': dtype,
|
||||
'dtype': self.dtype,
|
||||
'quantization': self.quantise,
|
||||
})
|
||||
}
|
||||
if any(k in self.engine_args for k in overriden_dict.keys()):
|
||||
logger.warning(
|
||||
'The following key will be overriden by openllm: %s (got %s set)',
|
||||
list(overriden_dict),
|
||||
[k for k in overriden_dict if k in self.engine_args],
|
||||
)
|
||||
|
||||
self.engine_args.update(overriden_dict)
|
||||
if 'worker_use_ray' not in self.engine_args:
|
||||
self.engine_args['worker_use_ray'] = False
|
||||
if 'tokenizer_mode' not in self.engine_args:
|
||||
self.engine_args['tokenizer_mode'] = 'auto'
|
||||
if 'disable_log_stats' not in self.engine_args:
|
||||
self.engine_args['disable_log_stats'] = not get_debug_mode()
|
||||
if 'gpu_memory_utilization' not in self.engine_args:
|
||||
|
||||
@@ -13,7 +13,7 @@ Fine-tune, serve, deploy, and monitor any LLMs with ease.
|
||||
# fmt: off
|
||||
# update-config-stubs.py: import stubs start
|
||||
from openllm_client import AsyncHTTPClient as AsyncHTTPClient, HTTPClient as HTTPClient
|
||||
from openlm_core.config import CONFIG_MAPPING as CONFIG_MAPPING, CONFIG_MAPPING_NAMES as CONFIG_MAPPING_NAMES, AutoConfig as AutoConfig, BaichuanConfig as BaichuanConfig, ChatGLMConfig as ChatGLMConfig, DbrxConfig as DbrxConfig, DollyV2Config as DollyV2Config, FalconConfig as FalconConfig, GemmaConfig as GemmaConfig, GPTNeoXConfig as GPTNeoXConfig, LlamaConfig as LlamaConfig, MistralConfig as MistralConfig, MixtralConfig as MixtralConfig, MPTConfig as MPTConfig, OPTConfig as OPTConfig, PhiConfig as PhiConfig, QwenConfig as QwenConfig, StableLMConfig as StableLMConfig, StarCoderConfig as StarCoderConfig, YiConfig as YiConfig
|
||||
from openlm_core.config import CONFIG_MAPPING as CONFIG_MAPPING, CONFIG_MAPPING_NAMES as CONFIG_MAPPING_NAMES, AutoConfig as AutoConfig, BaichuanConfig as BaichuanConfig, ChatGLMConfig as ChatGLMConfig, CohereConfig as CohereConfig, DbrxConfig as DbrxConfig, DollyV2Config as DollyV2Config, FalconConfig as FalconConfig, GemmaConfig as GemmaConfig, GPTNeoXConfig as GPTNeoXConfig, LlamaConfig as LlamaConfig, MistralConfig as MistralConfig, MixtralConfig as MixtralConfig, MPTConfig as MPTConfig, OPTConfig as OPTConfig, PhiConfig as PhiConfig, QwenConfig as QwenConfig, StableLMConfig as StableLMConfig, StarCoderConfig as StarCoderConfig, YiConfig as YiConfig
|
||||
from openllm_core._configuration import GenerationConfig as GenerationConfig, LLMConfig as LLMConfig
|
||||
from openllm_core._schemas import GenerationInput as GenerationInput, GenerationOutput as GenerationOutput, MetadataOutput as MetadataOutput, MessageParam as MessageParam
|
||||
from openllm_core.utils import api as api
|
||||
|
||||
@@ -6,7 +6,6 @@ from openllm_core.utils import (
|
||||
DEV_DEBUG_VAR as DEV_DEBUG_VAR,
|
||||
ENV_VARS_TRUE_VALUES as ENV_VARS_TRUE_VALUES,
|
||||
MYPY as MYPY,
|
||||
OPTIONAL_DEPENDENCIES as OPTIONAL_DEPENDENCIES,
|
||||
QUIET_ENV_VAR as QUIET_ENV_VAR,
|
||||
SHOW_CODEGEN as SHOW_CODEGEN,
|
||||
LazyLoader as LazyLoader,
|
||||
|
||||
Reference in New Issue
Block a user