feat(models): command-r (#1005)

* feat(models): add support for command-r

Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com>

* feat(models): support command-r and remove deadcode and extensions

Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com>

* chore: update local.sh script

Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com>

---------

Signed-off-by: paperspace <29749331+aarnphm@users.noreply.github.com>
This commit is contained in:
Aaron Pham
2024-06-02 10:16:08 -04:00
committed by GitHub
parent 9649073713
commit bf28f977bc
28 changed files with 628 additions and 923 deletions

356
openllm-python/README.md generated
View File

@@ -101,24 +101,16 @@ OpenLLM currently supports the following models. By default, OpenLLM doesn't inc
### Quickstart
> **Note:** Baichuan requires to install with:
> ```bash
> pip install "openllm[baichuan]"
> ```
Run the following command to quickly spin up a Baichuan server:
```bash
TRUST_REMOTE_CODE=True openllm start baichuan-inc/baichuan-7b
openllm start baichuan-inc/baichuan-7b --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -145,24 +137,16 @@ You can specify any of the following Baichuan models via `openllm start`:
### Quickstart
> **Note:** ChatGLM requires to install with:
> ```bash
> pip install "openllm[chatglm]"
> ```
Run the following command to quickly spin up a ChatGLM server:
```bash
TRUST_REMOTE_CODE=True openllm start thudm/chatglm-6b
openllm start thudm/chatglm-6b --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -186,29 +170,55 @@ You can specify any of the following ChatGLM models via `openllm start`:
<details>
<summary>Cohere</summary>
### Quickstart
Run the following command to quickly spin up a Cohere server:
```bash
openllm start CohereForAI/c4ai-command-r-plus --trust-remote-code
```
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
> **Note:** Any Cohere variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=commandr) to see more Cohere-compatible models.
### Supported models
You can specify any of the following Cohere models via `openllm start`:
- [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus)
- [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01)
</details>
<details>
<summary>Dbrx</summary>
### Quickstart
> **Note:** Dbrx requires to install with:
> ```bash
> pip install "openllm[dbrx]"
> ```
Run the following command to quickly spin up a Dbrx server:
```bash
TRUST_REMOTE_CODE=True openllm start databricks/dbrx-instruct
openllm start databricks/dbrx-instruct --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -236,13 +246,13 @@ You can specify any of the following Dbrx models via `openllm start`:
Run the following command to quickly spin up a DollyV2 server:
```bash
TRUST_REMOTE_CODE=True openllm start databricks/dolly-v2-3b
openllm start databricks/dolly-v2-3b --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -268,24 +278,16 @@ You can specify any of the following DollyV2 models via `openllm start`:
### Quickstart
> **Note:** Falcon requires to install with:
> ```bash
> pip install "openllm[falcon]"
> ```
Run the following command to quickly spin up a Falcon server:
```bash
TRUST_REMOTE_CODE=True openllm start tiiuae/falcon-7b
openllm start tiiuae/falcon-7b --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -312,24 +314,16 @@ You can specify any of the following Falcon models via `openllm start`:
### Quickstart
> **Note:** Gemma requires to install with:
> ```bash
> pip install "openllm[gemma]"
> ```
Run the following command to quickly spin up a Gemma server:
```bash
TRUST_REMOTE_CODE=True openllm start google/gemma-7b
openllm start google/gemma-7b --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -359,13 +353,13 @@ You can specify any of the following Gemma models via `openllm start`:
Run the following command to quickly spin up a GPTNeoX server:
```bash
TRUST_REMOTE_CODE=True openllm start eleutherai/gpt-neox-20b
openllm start eleutherai/gpt-neox-20b --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -389,24 +383,16 @@ You can specify any of the following GPTNeoX models via `openllm start`:
### Quickstart
> **Note:** Llama requires to install with:
> ```bash
> pip install "openllm[llama]"
> ```
Run the following command to quickly spin up a Llama server:
```bash
TRUST_REMOTE_CODE=True openllm start NousResearch/llama-2-7b-hf
openllm start NousResearch/llama-2-7b-hf --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -441,24 +427,16 @@ You can specify any of the following Llama models via `openllm start`:
### Quickstart
> **Note:** Mistral requires to install with:
> ```bash
> pip install "openllm[mistral]"
> ```
Run the following command to quickly spin up a Mistral server:
```bash
TRUST_REMOTE_CODE=True openllm start mistralai/Mistral-7B-Instruct-v0.1
openllm start mistralai/Mistral-7B-Instruct-v0.1 --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -486,24 +464,16 @@ You can specify any of the following Mistral models via `openllm start`:
### Quickstart
> **Note:** Mixtral requires to install with:
> ```bash
> pip install "openllm[mixtral]"
> ```
Run the following command to quickly spin up a Mixtral server:
```bash
TRUST_REMOTE_CODE=True openllm start mistralai/Mixtral-8x7B-Instruct-v0.1
openllm start mistralai/Mixtral-8x7B-Instruct-v0.1 --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -528,24 +498,16 @@ You can specify any of the following Mixtral models via `openllm start`:
### Quickstart
> **Note:** MPT requires to install with:
> ```bash
> pip install "openllm[mpt]"
> ```
Run the following command to quickly spin up a MPT server:
```bash
TRUST_REMOTE_CODE=True openllm start mosaicml/mpt-7b-instruct
openllm start mosaicml/mpt-7b-instruct --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -575,24 +537,16 @@ You can specify any of the following MPT models via `openllm start`:
### Quickstart
> **Note:** OPT requires to install with:
> ```bash
> pip install "openllm[opt]"
> ```
Run the following command to quickly spin up a OPT server:
```bash
openllm start facebook/opt-1.3b
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -621,24 +575,16 @@ You can specify any of the following OPT models via `openllm start`:
### Quickstart
> **Note:** Phi requires to install with:
> ```bash
> pip install "openllm[phi]"
> ```
Run the following command to quickly spin up a Phi server:
```bash
TRUST_REMOTE_CODE=True openllm start microsoft/Phi-3-mini-4k-instruct
openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -667,24 +613,16 @@ You can specify any of the following Phi models via `openllm start`:
### Quickstart
> **Note:** Qwen requires to install with:
> ```bash
> pip install "openllm[qwen]"
> ```
Run the following command to quickly spin up a Qwen server:
```bash
TRUST_REMOTE_CODE=True openllm start qwen/Qwen-7B-Chat
openllm start qwen/Qwen-7B-Chat --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -713,24 +651,16 @@ You can specify any of the following Qwen models via `openllm start`:
### Quickstart
> **Note:** StableLM requires to install with:
> ```bash
> pip install "openllm[stablelm]"
> ```
Run the following command to quickly spin up a StableLM server:
```bash
TRUST_REMOTE_CODE=True openllm start stabilityai/stablelm-tuned-alpha-3b
openllm start stabilityai/stablelm-tuned-alpha-3b --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -757,24 +687,16 @@ You can specify any of the following StableLM models via `openllm start`:
### Quickstart
> **Note:** StarCoder requires to install with:
> ```bash
> pip install "openllm[starcoder]"
> ```
Run the following command to quickly spin up a StarCoder server:
```bash
TRUST_REMOTE_CODE=True openllm start bigcode/starcoder
openllm start bigcode/starcoder --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```
@@ -799,24 +721,16 @@ You can specify any of the following StarCoder models via `openllm start`:
### Quickstart
> **Note:** Yi requires to install with:
> ```bash
> pip install "openllm[yi]"
> ```
Run the following command to quickly spin up a Yi server:
```bash
TRUST_REMOTE_CODE=True openllm start 01-ai/Yi-6B
openllm start 01-ai/Yi-6B --trust-remote-code
```
In a different terminal, run the following command to interact with the server:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
You can run the following code in a different terminal to interact with the server:
```python
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
```

View File

@@ -39,21 +39,16 @@ classifiers = [
]
dependencies = [
"bentoml[io]>=1.2.16",
"transformers[torch,tokenizers]>=4.36.0",
"openllm-client>=0.5.4",
"openllm-core>=0.5.4",
"safetensors",
"vllm>=0.4.2",
"optimum>=1.12.0",
"accelerate",
"ghapi",
"einops",
"sentencepiece",
"scipy",
"build[virtualenv]<1",
"click>=8.1.3",
"cuda-python;platform_system!=\"Darwin\"",
"bitsandbytes<0.42",
]
description = "OpenLLM: Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud."
dynamic = ["version", "readme"]
@@ -94,38 +89,6 @@ Homepage = "https://bentoml.com"
Tracker = "https://github.com/bentoml/OpenLLM/issues"
Twitter = "https://twitter.com/bentomlai"
[project.optional-dependencies]
agents = ["transformers[agents]>=4.36.0", "diffusers", "soundfile"]
all = ["openllm[full]"]
awq = ["autoawq"]
baichuan = ["cpm-kernels"]
chatglm = ["cpm-kernels"]
dbrx = ["cpm-kernels"]
dolly-v2 = ["cpm-kernels"]
falcon = ["xformers"]
fine-tune = ["peft>=0.6.0", "datasets", "trl", "huggingface-hub"]
full = [
"openllm[agents,awq,baichuan,chatglm,dbrx,dolly-v2,falcon,fine-tune,gemma,ggml,gpt-neox,gptq,grpc,llama,mistral,mixtral,mpt,openai,opt,phi,playground,qwen,stablelm,starcoder,vllm,yi]",
]
gemma = ["xformers"]
ggml = ["ctransformers"]
gpt-neox = ["xformers"]
gptq = ["auto-gptq[triton]>=0.4.2"]
grpc = ["bentoml[grpc]>=1.2.16"]
llama = ["xformers"]
mistral = ["xformers"]
mixtral = ["xformers"]
mpt = ["triton"]
openai = ["openai[datalib]>=1", "tiktoken", "fastapi"]
opt = ["triton"]
phi = ["triton"]
playground = ["jupyter", "notebook", "ipython", "jupytext", "nbformat"]
qwen = ["cpm-kernels", "tiktoken"]
stablelm = ["cpm-kernels", "tiktoken"]
starcoder = ["bitsandbytes"]
vllm = ["vllm==0.4.2"]
yi = ["bitsandbytes"]
[tool.hatch.version]
fallback-version = "0.0.0"
source = "vcs"

View File

@@ -105,7 +105,7 @@ def optimization_decorator(fn: t.Callable[..., t.Any]):
'--quantise',
'--quantize',
'quantise',
type=str,
type=click.Choice(get_literal_args(LiteralQuantise)),
default=None,
envvar='QUANTIZE',
show_envvar=True,

View File

@@ -1,6 +1,6 @@
from __future__ import annotations
import inspect, orjson, dataclasses, bentoml, functools, attr, openllm_core, traceback, openllm, typing as t
import inspect, orjson, logging, dataclasses, bentoml, functools, attr, os, openllm_core, traceback, openllm, typing as t
from openllm_core.utils import (
get_debug_mode,
@@ -10,11 +10,13 @@ from openllm_core.utils import (
dict_filter_none,
Counter,
)
from openllm_core._typing_compat import LiteralQuantise, LiteralSerialisation, LiteralDtype
from openllm_core._typing_compat import LiteralQuantise, LiteralSerialisation, LiteralDtype, get_literal_args
from openllm_core._schemas import GenerationOutput
Dtype = t.Union[LiteralDtype, t.Literal['auto', 'half', 'float']]
logger = logging.getLogger(__name__)
if t.TYPE_CHECKING:
from vllm import AsyncEngineArgs, EngineArgs, RequestOutput
@@ -30,11 +32,24 @@ def check_engine_args(_, attr: attr.Attribute[dict[str, t.Any]], v: dict[str, t.
def check_quantization(_, attr: attr.Attribute[LiteralQuantise], v: str | None) -> LiteralQuantise | None:
if v is not None and v not in {'gptq', 'awq', 'squeezellm'}:
if v is not None and v not in get_literal_args(LiteralQuantise):
raise ValueError(f'Invalid quantization method: {v}')
return v
def update_engine_args(v: t.Dict[str, t.Any]) -> t.Dict[str, t.Any]:
env_json_string = os.environ.get('ENGINE_CONFIG', None)
config_from_env = {}
if env_json_string is not None:
try:
config_from_env = orjson.loads(env_json_string)
except orjson.JSONDecodeError as e:
raise RuntimeError("Failed to parse 'ENGINE_CONFIG' as valid JSON string.") from e
config_from_env.update(v)
return config_from_env
@attr.define(init=False)
class LLM:
model_id: str
@@ -44,7 +59,7 @@ class LLM:
dtype: Dtype
quantise: t.Optional[LiteralQuantise] = attr.field(default=None, validator=check_quantization)
trust_remote_code: bool = attr.field(default=False)
engine_args: t.Dict[str, t.Any] = attr.field(factory=dict, validator=check_engine_args)
engine_args: t.Dict[str, t.Any] = attr.field(factory=dict, validator=check_engine_args, converter=update_engine_args)
_mode: t.Literal['batch', 'async'] = attr.field(default='async', repr=False)
_path: str = attr.field(
@@ -117,18 +132,27 @@ class LLM:
num_gpus, dev = 1, openllm.utils.device_count()
if dev >= 2:
num_gpus = min(dev // 2 * 2, dev)
dtype = 'float16' if self.quantise == 'gptq' else self.dtype # NOTE: quantise GPTQ doesn't support bfloat16 yet.
self.engine_args.update({
'worker_use_ray': False,
'tokenizer_mode': 'auto',
overriden_dict = {
'tensor_parallel_size': num_gpus,
'model': self._path,
'tokenizer': self._path,
'trust_remote_code': self.trust_remote_code,
'dtype': dtype,
'dtype': self.dtype,
'quantization': self.quantise,
})
}
if any(k in self.engine_args for k in overriden_dict.keys()):
logger.warning(
'The following key will be overriden by openllm: %s (got %s set)',
list(overriden_dict),
[k for k in overriden_dict if k in self.engine_args],
)
self.engine_args.update(overriden_dict)
if 'worker_use_ray' not in self.engine_args:
self.engine_args['worker_use_ray'] = False
if 'tokenizer_mode' not in self.engine_args:
self.engine_args['tokenizer_mode'] = 'auto'
if 'disable_log_stats' not in self.engine_args:
self.engine_args['disable_log_stats'] = not get_debug_mode()
if 'gpu_memory_utilization' not in self.engine_args:

View File

@@ -13,7 +13,7 @@ Fine-tune, serve, deploy, and monitor any LLMs with ease.
# fmt: off
# update-config-stubs.py: import stubs start
from openllm_client import AsyncHTTPClient as AsyncHTTPClient, HTTPClient as HTTPClient
from openlm_core.config import CONFIG_MAPPING as CONFIG_MAPPING, CONFIG_MAPPING_NAMES as CONFIG_MAPPING_NAMES, AutoConfig as AutoConfig, BaichuanConfig as BaichuanConfig, ChatGLMConfig as ChatGLMConfig, DbrxConfig as DbrxConfig, DollyV2Config as DollyV2Config, FalconConfig as FalconConfig, GemmaConfig as GemmaConfig, GPTNeoXConfig as GPTNeoXConfig, LlamaConfig as LlamaConfig, MistralConfig as MistralConfig, MixtralConfig as MixtralConfig, MPTConfig as MPTConfig, OPTConfig as OPTConfig, PhiConfig as PhiConfig, QwenConfig as QwenConfig, StableLMConfig as StableLMConfig, StarCoderConfig as StarCoderConfig, YiConfig as YiConfig
from openlm_core.config import CONFIG_MAPPING as CONFIG_MAPPING, CONFIG_MAPPING_NAMES as CONFIG_MAPPING_NAMES, AutoConfig as AutoConfig, BaichuanConfig as BaichuanConfig, ChatGLMConfig as ChatGLMConfig, CohereConfig as CohereConfig, DbrxConfig as DbrxConfig, DollyV2Config as DollyV2Config, FalconConfig as FalconConfig, GemmaConfig as GemmaConfig, GPTNeoXConfig as GPTNeoXConfig, LlamaConfig as LlamaConfig, MistralConfig as MistralConfig, MixtralConfig as MixtralConfig, MPTConfig as MPTConfig, OPTConfig as OPTConfig, PhiConfig as PhiConfig, QwenConfig as QwenConfig, StableLMConfig as StableLMConfig, StarCoderConfig as StarCoderConfig, YiConfig as YiConfig
from openllm_core._configuration import GenerationConfig as GenerationConfig, LLMConfig as LLMConfig
from openllm_core._schemas import GenerationInput as GenerationInput, GenerationOutput as GenerationOutput, MetadataOutput as MetadataOutput, MessageParam as MessageParam
from openllm_core.utils import api as api

View File

@@ -6,7 +6,6 @@ from openllm_core.utils import (
DEV_DEBUG_VAR as DEV_DEBUG_VAR,
ENV_VARS_TRUE_VALUES as ENV_VARS_TRUE_VALUES,
MYPY as MYPY,
OPTIONAL_DEPENDENCIES as OPTIONAL_DEPENDENCIES,
QUIET_ENV_VAR as QUIET_ENV_VAR,
SHOW_CODEGEN as SHOW_CODEGEN,
LazyLoader as LazyLoader,