diff --git a/ADDING_NEW_MODEL.md b/ADDING_NEW_MODEL.md
index 0e93aefe..15996e19 100644
--- a/ADDING_NEW_MODEL.md
+++ b/ADDING_NEW_MODEL.md
@@ -15,7 +15,7 @@ Here's your roadmap:
`$GIT_ROOT/openllm-core/src/openllm_core/config/configuration_{model_name}.py`
- [ ] Update `$GIT_ROOT/openllm-core/src/openllm_core/config/__init__.py` to import the new model
- [ ] Add your new model entry in `$GIT_ROOT/openllm-core/src/openllm_core/config/configuration_auto.py` with a tuple of the `model_name` alongside with the `ModelConfig`
-- [ ] Run `./tools/update-config-stubs.py`
+- [ ] Run `./tools/update-config-stubs.py` and `./tools/update-readme.py`
> [!NOTE]
>
@@ -40,8 +40,3 @@ the `__init__` file, and the tuple under `CONFIG_MAPPING_NAMES` in [openllm-core
Once you have completed the checklist above, raise a PR and the OpenLLMs
maintainer will review it ASAP. Once the PR is merged, you should be able to see
your model in the next release! 🎉 🎊
-
-### Updating README.md
-
-After a model is added, just ping OpenLLM's maintainers to update the README.md file
-with the new model.
diff --git a/README.md b/README.md
index abf7d255..36de9bce 100644
--- a/README.md
+++ b/README.md
@@ -165,86 +165,415 @@ openllm start facebook/opt-2.7b
OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
+
-Mistral
+
+Baichuan
+
### Quickstart
-Run the following commands to quickly spin up a Llama 2 server and send a request to it.
+Run the following command to quickly spin up a Baichuan server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start baichuan-inc/baichuan-7b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start HuggingFaceH4/zephyr-7b-beta
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
-> [!NOTE]
-> Note that any Mistral variants can be deployed with OpenLLM.
-> Visit the [Hugging Face Model Hub](https://huggingface.co/models?sort=trending&search=mistral) to see more Mistral compatible models.
+
+> **Note:** Any Baichuan variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=baichuan) to see more Baichuan-compatible models.
+
+
### Supported models
-You can specify any of the following Mistral models by using `--model-id`.
+You can specify any of the following Baichuan models via `openllm start`:
-- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
-- [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
-- [amazon/MistralLite](https://huggingface.co/amazon/MistralLite)
-- [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
-- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
-- Any other models that strictly follows the [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM) architecture
+- [baichuan-inc/baichuan-7b](https://huggingface.co/baichuan-inc/baichuan-7b)
+- [baichuan-inc/baichuan-13b-base](https://huggingface.co/baichuan-inc/baichuan-13b-base)
+- [baichuan-inc/baichuan-13b-chat](https://huggingface.co/baichuan-inc/baichuan-13b-chat)
+- [fireballoon/baichuan-vicuna-chinese-7b](https://huggingface.co/fireballoon/baichuan-vicuna-chinese-7b)
+- [fireballoon/baichuan-vicuna-7b](https://huggingface.co/fireballoon/baichuan-vicuna-7b)
+- [hiyouga/baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft)
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
- ```bash
- openllm start HuggingFaceH4/zephyr-7b-beta --backend pt
- ```
- vLLM (Recommended):
- ```bash
- pip install "openllm[vllm]"
- openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
- ```
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start baichuan-inc/baichuan-7b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start baichuan-inc/baichuan-7b --backend pt
+```
-Llama
-### Installation
+ChatGLM
-To run Llama models with OpenLLM, you need to install the `llama` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[llama]"
-```
### Quickstart
-Run the following commands to quickly spin up a Llama 2 server and send a request to it.
+Run the following command to quickly spin up a ChatGLM server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start thudm/chatglm-6b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start meta-llama/Llama-2-7b-chat-hf
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
-> [!NOTE]
-> To use the official Llama 2 models, you must gain access by visiting
-> the [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and
-> accepting its license terms and acceptable use policy. You also need to obtain access to these
-> models on [Hugging Face](https://huggingface.co/meta-llama). Note that any Llama 2 variants can
-> be deployed with OpenLLM if you don’t have access to the official Llama 2 model.
-> Visit the [Hugging Face Model Hub](https://huggingface.co/models?sort=trending&search=llama2) to see more Llama 2 compatible models.
+
+> **Note:** Any ChatGLM variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=chatglm) to see more ChatGLM-compatible models.
+
+
### Supported models
-You can specify any of the following Llama models by using `--model-id`.
+You can specify any of the following ChatGLM models via `openllm start`:
+
+
+- [thudm/chatglm-6b](https://huggingface.co/thudm/chatglm-6b)
+- [thudm/chatglm-6b-int8](https://huggingface.co/thudm/chatglm-6b-int8)
+- [thudm/chatglm-6b-int4](https://huggingface.co/thudm/chatglm-6b-int4)
+- [thudm/chatglm2-6b](https://huggingface.co/thudm/chatglm2-6b)
+- [thudm/chatglm2-6b-int4](https://huggingface.co/thudm/chatglm2-6b-int4)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- PyTorch:
+
+
+```bash
+openllm start thudm/chatglm-6b --backend pt
+```
+
+
+
+
+
+DollyV2
+
+
+### Quickstart
+
+Run the following command to quickly spin up a DollyV2 server:
+
+```bash
+openllm start databricks/dolly-v2-3b
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any DollyV2 variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=dolly_v2) to see more DollyV2-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following DollyV2 models via `openllm start`:
+
+
+- [databricks/dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b)
+- [databricks/dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b)
+- [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start databricks/dolly-v2-3b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start databricks/dolly-v2-3b --backend pt
+```
+
+
+
+
+
+Falcon
+
+
+### Quickstart
+
+Run the following command to quickly spin up a Falcon server:
+
+```bash
+openllm start tiiuae/falcon-7b
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any Falcon variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=falcon) to see more Falcon-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following Falcon models via `openllm start`:
+
+
+- [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
+- [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
+- [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
+- [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start tiiuae/falcon-7b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start tiiuae/falcon-7b --backend pt
+```
+
+
+
+
+
+FlanT5
+
+
+### Quickstart
+
+Run the following command to quickly spin up a FlanT5 server:
+
+```bash
+openllm start google/flan-t5-large
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any FlanT5 variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=flan_t5) to see more FlanT5-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following FlanT5 models via `openllm start`:
+
+
+- [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
+- [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)
+- [google/flan-t5-large](https://huggingface.co/google/flan-t5-large)
+- [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl)
+- [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- PyTorch:
+
+
+```bash
+openllm start google/flan-t5-small --backend pt
+```
+
+
+
+
+
+GPTNeoX
+
+
+### Quickstart
+
+Run the following command to quickly spin up a GPTNeoX server:
+
+```bash
+openllm start eleutherai/gpt-neox-20b
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any GPTNeoX variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=gpt_neox) to see more GPTNeoX-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following GPTNeoX models via `openllm start`:
+
+
+- [eleutherai/gpt-neox-20b](https://huggingface.co/eleutherai/gpt-neox-20b)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start eleutherai/gpt-neox-20b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start eleutherai/gpt-neox-20b --backend pt
+```
+
+
+
+
+
+Llama
+
+
+### Quickstart
+
+Run the following command to quickly spin up a Llama server:
+
+```bash
+openllm start NousResearch/llama-2-7b-hf
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any Llama variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=llama) to see more Llama-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following Llama models via `openllm start`:
+
- [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
- [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
@@ -258,293 +587,139 @@ You can specify any of the following Llama models by using `--model-id`.
- [NousResearch/llama-2-70b-hf](https://huggingface.co/NousResearch/llama-2-70b-hf)
- [NousResearch/llama-2-13b-hf](https://huggingface.co/NousResearch/llama-2-13b-hf)
- [NousResearch/llama-2-7b-hf](https://huggingface.co/NousResearch/llama-2-7b-hf)
-- [openlm-research/open_llama_7b_v2](https://huggingface.co/openlm-research/open_llama_7b_v2)
-- [openlm-research/open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2)
-- [openlm-research/open_llama_13b](https://huggingface.co/openlm-research/open_llama_13b)
-- [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
-- [huggyllama/llama-30b](https://huggingface.co/huggyllama/llama-30b)
-- [huggyllama/llama-13b](https://huggingface.co/huggyllama/llama-13b)
-- [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b)
-- Any other models that strictly follows the [LlamaForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/llama#transformers.LlamaForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
- ```bash
- openllm start meta-llama/Llama-2-7b-chat-hf --backend pt
- ```
- vLLM (Recommended):
- ```bash
- pip install "openllm[llama, vllm]"
- openllm start meta-llama/Llama-2-7b-chat-hf --backend vllm
- ```
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start meta-llama/Llama-2-70b-chat-hf --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start meta-llama/Llama-2-70b-chat-hf --backend pt
+```
-ChatGLM
-### Installation
+Mistral
-To run ChatGLM models with OpenLLM, you need to install the `chatglm` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[chatglm]"
-```
### Quickstart
-Run the following commands to quickly spin up a ChatGLM server and send a request to it.
+Run the following command to quickly spin up a Mistral server:
+
+```bash
+openllm start mistralai/Mistral-7B-Instruct-v0.1
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start thudm/chatglm2-6b
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any Mistral variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=mistral) to see more Mistral-compatible models.
+
+
+
### Supported models
-You can specify any of the following ChatGLM models by using `--model-id`.
+You can specify any of the following Mistral models via `openllm start`:
-- [thudm/chatglm-6b](https://huggingface.co/thudm/chatglm-6b)
-- [thudm/chatglm-6b-int8](https://huggingface.co/thudm/chatglm-6b-int8)
-- [thudm/chatglm-6b-int4](https://huggingface.co/thudm/chatglm-6b-int4)
-- [thudm/chatglm2-6b](https://huggingface.co/thudm/chatglm2-6b)
-- [thudm/chatglm2-6b-int4](https://huggingface.co/thudm/chatglm2-6b-int4)
-- Any other models that strictly follows the [ChatGLMForConditionalGeneration](https://github.com/THUDM/ChatGLM-6B) architecture
+- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
+- [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
+- [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
+- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start thudm/chatglm2-6b --backend pt
- ```
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start HuggingFaceH4/zephyr-7b-alpha --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start HuggingFaceH4/zephyr-7b-alpha --backend pt
+```
-Dolly-v2
-### Installation
-
-Dolly-v2 models do not require you to install any model-specific dependencies once you have `openllm` installed.
-
-```bash
-pip install openllm
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a Dolly-v2 server and send a request to it.
-
-```bash
-openllm start databricks/dolly-v2-3b
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following Dolly-v2 models by using `--model-id`.
-
-- [databricks/dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b)
-- [databricks/dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b)
-- [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
-- Any other models that strictly follows the [GPTNeoXForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_neox#transformers.GPTNeoXForCausalLM) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start databricks/dolly-v2-3b --backend pt
- ```
-
-- vLLM:
-
- ```bash
- openllm start databricks/dolly-v2-3b --backend vllm
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
-Falcon
-
-### Installation
-
-To run Falcon models with OpenLLM, you need to install the `falcon` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[falcon]"
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a Falcon server and send a request to it.
-
-```bash
-openllm start tiiuae/falcon-7b
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following Falcon models by using `--model-id`.
-
-- [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
-- [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
-- [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
-- [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct)
-- Any other models that strictly follows the [FalconForCausalLM](https://falconllm.tii.ae/) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start tiiuae/falcon-7b --backend pt
- ```
-
-- vLLM:
-
- ```bash
- pip install "openllm[falcon, vllm]"
- openllm start tiiuae/falcon-7b --backend vllm
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
-Flan-T5
-
-### Installation
-
-To run Flan-T5 models with OpenLLM, you need to install the `flan-t5` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[flan-t5]"
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a Flan-T5 server and send a request to it.
-
-```bash
-openllm start google/flan-t5-large
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following Flan-T5 models by using `--model-id`.
-
-- [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
-- [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)
-- [google/flan-t5-large](https://huggingface.co/google/flan-t5-large)
-- [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl)
-- [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl)
-- Any other models that strictly follows the [T5ForConditionalGeneration](https://huggingface.co/docs/transformers/main/model_doc/t5#transformers.T5ForConditionalGeneration) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start google/flan-t5-large --backend pt
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
-GPT-NeoX
-
-### Installation
-
-GPT-NeoX models do not require you to install any model-specific dependencies once you have `openllm` installed.
-
-```bash
-pip install openllm
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a GPT-NeoX server and send a request to it.
-
-```bash
-openllm start eleutherai/gpt-neox-20b
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following GPT-NeoX models by using `--model-id`.
-
-- [eleutherai/gpt-neox-20b](https://huggingface.co/eleutherai/gpt-neox-20b)
-- Any other models that strictly follows the [GPTNeoXForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_neox#transformers.GPTNeoXForCausalLM) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start eleutherai/gpt-neox-20b --backend pt
- ```
-
-- vLLM:
-
- ```bash
- openllm start eleutherai/gpt-neox-20b --backend vllm
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
MPT
-### Installation
-
-To run MPT models with OpenLLM, you need to install the `mpt` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[mpt]"
-```
### Quickstart
-Run the following commands to quickly spin up a MPT server and send a request to it.
+Run the following command to quickly spin up a MPT server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start mosaicml/mpt-7b-instruct
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start mosaicml/mpt-7b-chat
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any MPT variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=mpt) to see more MPT-compatible models.
+
+
+
### Supported models
-You can specify any of the following MPT models by using `--model-id`.
+You can specify any of the following MPT models via `openllm start`:
+
- [mosaicml/mpt-7b](https://huggingface.co/mosaicml/mpt-7b)
- [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct)
@@ -553,52 +728,70 @@ You can specify any of the following MPT models by using `--model-id`.
- [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b)
- [mosaicml/mpt-30b-instruct](https://huggingface.co/mosaicml/mpt-30b-instruct)
- [mosaicml/mpt-30b-chat](https://huggingface.co/mosaicml/mpt-30b-chat)
-- Any other models that strictly follows the [MPTForCausalLM](https://huggingface.co/mosaicml) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
- ```bash
- openllm start mosaicml/mpt-7b-chat --backend pt
- ```
- vLLM (Recommended):
- ```bash
- pip install "openllm[mpt, vllm]"
- openllm start mosaicml/mpt-7b-chat --backend vllm
- ```
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start mosaicml/mpt-7b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start mosaicml/mpt-7b --backend pt
+```
+
OPT
-### Installation
-
-To run OPT models with OpenLLM, you need to install the `opt` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[opt]"
-```
### Quickstart
-Run the following commands to quickly spin up an OPT server and send a request to it.
+Run the following command to quickly spin up a OPT server:
+
+```bash
+openllm start facebook/opt-1.3b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start facebook/opt-2.7b
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any OPT variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=opt) to see more OPT-compatible models.
+
+
+
### Supported models
-You can specify any of the following OPT models by using `--model-id`.
+You can specify any of the following OPT models via `openllm start`:
+
- [facebook/opt-125m](https://huggingface.co/facebook/opt-125m)
- [facebook/opt-350m](https://huggingface.co/facebook/opt-350m)
@@ -606,180 +799,249 @@ You can specify any of the following OPT models by using `--model-id`.
- [facebook/opt-2.7b](https://huggingface.co/facebook/opt-2.7b)
- [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b)
- [facebook/opt-66b](https://huggingface.co/facebook/opt-66b)
-- Any other models that strictly follows the [OPTForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/opt#transformers.OPTForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start facebook/opt-2.7b --backend pt
- ```
-- vLLM:
- ```bash
- pip install "openllm[opt, vllm]"
- openllm start facebook/opt-2.7b --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start facebook/opt-125m --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start facebook/opt-125m --backend pt
+```
+
StableLM
-### Installation
-
-StableLM models do not require you to install any model-specific dependencies once you have `openllm` installed.
-
-```bash
-pip install openllm
-```
### Quickstart
-Run the following commands to quickly spin up a StableLM server and send a request to it.
+Run the following command to quickly spin up a StableLM server:
+
+```bash
+openllm start stabilityai/stablelm-tuned-alpha-3b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start stabilityai/stablelm-tuned-alpha-7b
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any StableLM variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=stablelm) to see more StableLM-compatible models.
+
+
+
### Supported models
-You can specify any of the following StableLM models by using `--model-id`.
+You can specify any of the following StableLM models via `openllm start`:
+
- [stabilityai/stablelm-tuned-alpha-3b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b)
- [stabilityai/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
- [stabilityai/stablelm-base-alpha-3b](https://huggingface.co/stabilityai/stablelm-base-alpha-3b)
- [stabilityai/stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b)
-- Any other models that strictly follows the [GPTNeoXForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_neox#transformers.GPTNeoXForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start stabilityai/stablelm-tuned-alpha-7b --backend pt
- ```
-- vLLM:
- ```bash
- openllm start stabilityai/stablelm-tuned-alpha-7b --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start stabilityai/stablelm-tuned-alpha-3b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start stabilityai/stablelm-tuned-alpha-3b --backend pt
+```
+
StarCoder
-### Installation
-
-To run StarCoder models with OpenLLM, you need to install the `starcoder` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[starcoder]"
-```
### Quickstart
-Run the following commands to quickly spin up a StarCoder server and send a request to it.
+Run the following command to quickly spin up a StarCoder server:
```bash
openllm start bigcode/starcoder
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any StarCoder variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=starcoder) to see more StarCoder-compatible models.
+
+
+
### Supported models
-You can specify any of the following StarCoder models by using `--model-id`.
+You can specify any of the following StarCoder models via `openllm start`:
+
- [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
- [bigcode/starcoderbase](https://huggingface.co/bigcode/starcoderbase)
-- Any other models that strictly follows the [GPTBigCodeForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_bigcode#transformers.GPTBigCodeForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start bigcode/starcoder --backend pt
- ```
-- vLLM:
- ```bash
- pip install "openllm[startcoder, vllm]"
- openllm start bigcode/starcoder --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start bigcode/starcoder --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start bigcode/starcoder --backend pt
+```
-Baichuan
-### Installation
+Yi
-To run Baichuan models with OpenLLM, you need to install the `baichuan` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[baichuan]"
-```
### Quickstart
-Run the following commands to quickly spin up a Baichuan server and send a request to it.
+Run the following command to quickly spin up a Yi server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start 01-ai/Yi-6B
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start baichuan-inc/baichuan-13b-base
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any Yi variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=yi) to see more Yi-compatible models.
+
+
+
### Supported models
-You can specify any of the following Baichuan models by using `--model-id`.
+You can specify any of the following Yi models via `openllm start`:
-- [baichuan-inc/baichuan-7b](https://huggingface.co/baichuan-inc/baichuan-7b)
-- [baichuan-inc/baichuan-13b-base](https://huggingface.co/baichuan-inc/baichuan-13b-base)
-- [baichuan-inc/baichuan-13b-chat](https://huggingface.co/baichuan-inc/baichuan-13b-chat)
-- [fireballoon/baichuan-vicuna-chinese-7b](https://huggingface.co/fireballoon/baichuan-vicuna-chinese-7b)
-- [fireballoon/baichuan-vicuna-7b](https://huggingface.co/fireballoon/baichuan-vicuna-7b)
-- [hiyouga/baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft)
-- Any other models that strictly follows the [BaiChuanForCausalLM](https://github.com/baichuan-inc/Baichuan-7B) architecture
+- [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B)
+- [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)
+- [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K)
+- [01-ai/Yi-34B-200K](https://huggingface.co/01-ai/Yi-34B-200K)
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start baichuan-inc/baichuan-13b-base --backend pt
- ```
-- vLLM:
- ```bash
- pip install "openllm[baichuan, vllm]"
- openllm start baichuan-inc/baichuan-13b-base --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start 01-ai/Yi-6B --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start 01-ai/Yi-6B --backend pt
+```
+
+
More models will be integrated with OpenLLM and we welcome your contributions if you want to incorporate your custom LLMs into the ecosystem. Check out [Adding a New Model Guide](https://github.com/bentoml/OpenLLM/blob/main/ADDING_NEW_MODEL.md) to learn more.
## 💻 Run your model on multiple GPUs
diff --git a/cz.py b/cz.py
index 2ce8788a..6bda9afc 100755
--- a/cz.py
+++ b/cz.py
@@ -1,5 +1,4 @@
#!/usr/bin/env python3
-from __future__ import annotations
import itertools
import os
import token
@@ -9,13 +8,15 @@ from tabulate import tabulate
TOKEN_WHITELIST = [token.OP, token.NAME, token.NUMBER, token.STRING]
+_ignored = ['_version.py']
-def run_cz(dir: str, package: str):
+
+def run_cz(dir, package):
headers = ['Name', 'Lines', 'Tokens/Line']
table = []
for path, _, files in os.walk(os.path.join(dir, 'src', package)):
for name in files:
- if not name.endswith('.py'):
+ if not name.endswith('.py') or name in _ignored:
continue
filepath = os.path.join(path, name)
with tokenize.open(filepath) as file_:
@@ -28,7 +29,6 @@ def run_cz(dir: str, package: str):
token_count / line_count if line_count != 0 else 0,
]
)
- print(f'\n{"=" * 80}\n')
print(tabulate([headers, *sorted(table, key=lambda x: -x[1])], headers='firstrow', floatfmt='.1f') + '\n')
print(
tabulate(
@@ -43,15 +43,8 @@ def run_cz(dir: str, package: str):
)
)
print(f'total line count for {package}: {sum([x[1] for x in table])}\n')
-
-
-def main() -> int:
- run_cz('openllm-python', 'openllm')
- run_cz('openllm-python', 'openllm_cli')
- run_cz('openllm-core', 'openllm_core')
- run_cz('openllm-client', 'openllm_client')
return 0
if __name__ == '__main__':
- raise SystemExit(main())
+ raise SystemExit((lambda: run_cz('openllm-python', 'openllm') or 0)())
diff --git a/openllm-core/src/openllm_core/config/configuration_auto.py b/openllm-core/src/openllm_core/config/configuration_auto.py
index 66db071f..2c38a3f1 100644
--- a/openllm-core/src/openllm_core/config/configuration_auto.py
+++ b/openllm-core/src/openllm_core/config/configuration_auto.py
@@ -28,21 +28,23 @@ else:
# NOTE: This is the entrypoint when adding new model config
CONFIG_MAPPING_NAMES = OrderedDict(
- [
- ('chatglm', 'ChatGLMConfig'),
- ('dolly_v2', 'DollyV2Config'),
- ('falcon', 'FalconConfig'),
- ('flan_t5', 'FlanT5Config'),
- ('gpt_neox', 'GPTNeoXConfig'),
- ('llama', 'LlamaConfig'),
- ('mpt', 'MPTConfig'),
- ('opt', 'OPTConfig'),
- ('stablelm', 'StableLMConfig'),
- ('starcoder', 'StarCoderConfig'),
- ('mistral', 'MistralConfig'),
- ('yi', 'YiConfig'),
- ('baichuan', 'BaichuanConfig'),
- ]
+ sorted(
+ [
+ ('chatglm', 'ChatGLMConfig'),
+ ('dolly_v2', 'DollyV2Config'),
+ ('falcon', 'FalconConfig'),
+ ('flan_t5', 'FlanT5Config'),
+ ('gpt_neox', 'GPTNeoXConfig'),
+ ('llama', 'LlamaConfig'),
+ ('mpt', 'MPTConfig'),
+ ('opt', 'OPTConfig'),
+ ('stablelm', 'StableLMConfig'),
+ ('starcoder', 'StarCoderConfig'),
+ ('mistral', 'MistralConfig'),
+ ('yi', 'YiConfig'),
+ ('baichuan', 'BaichuanConfig'),
+ ]
+ )
)
diff --git a/openllm-core/src/openllm_core/config/configuration_mistral.py b/openllm-core/src/openllm_core/config/configuration_mistral.py
index 3fb29e24..94ce4e5d 100644
--- a/openllm-core/src/openllm_core/config/configuration_mistral.py
+++ b/openllm-core/src/openllm_core/config/configuration_mistral.py
@@ -34,7 +34,12 @@ class MistralConfig(openllm_core.LLMConfig):
'serialisation': 'safetensors',
# NOTE: see https://docs.mistral.ai/usage/guardrailing/
# and https://docs.mistral.ai/llm/mistral-instruct-v0.1
- 'model_ids': ['mistralai/Mistral-7B-Instruct-v0.1', 'mistralai/Mistral-7B-v0.1'],
+ 'model_ids': [
+ 'HuggingFaceH4/zephyr-7b-alpha',
+ 'HuggingFaceH4/zephyr-7b-beta',
+ 'mistralai/Mistral-7B-Instruct-v0.1',
+ 'mistralai/Mistral-7B-v0.1',
+ ],
'fine_tune_strategies': (
{'adapter_type': 'lora', 'r': 64, 'lora_alpha': 16, 'lora_dropout': 0.1, 'bias': 'none'},
),
diff --git a/openllm-python/README.md b/openllm-python/README.md
index 874aaee3..36de9bce 100644
--- a/openllm-python/README.md
+++ b/openllm-python/README.md
@@ -45,7 +45,7 @@ Key features include:
🔥 **Flexible APIs**: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice.
-⛓️ **Freedom to build**: First-class support for LangChain, BentoML and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.
+⛓️ **Freedom to build**: First-class support for LangChain, BentoML, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.
🎯 **Streamline deployment**: Automatically generate your LLM server Docker images or deploy as serverless endpoints via
[☁️ BentoCloud](https://l.bentoml.com/bento-cloud), which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness.
@@ -165,86 +165,415 @@ openllm start facebook/opt-2.7b
OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
+
-Mistral
+
+Baichuan
+
### Quickstart
-Run the following commands to quickly spin up a Llama 2 server and send a request to it.
+Run the following command to quickly spin up a Baichuan server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start baichuan-inc/baichuan-7b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start HuggingFaceH4/zephyr-7b-beta
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
-> [!NOTE]
-> Note that any Mistral variants can be deployed with OpenLLM.
-> Visit the [Hugging Face Model Hub](https://huggingface.co/models?sort=trending&search=mistral) to see more Mistral compatible models.
+
+> **Note:** Any Baichuan variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=baichuan) to see more Baichuan-compatible models.
+
+
### Supported models
-You can specify any of the following Mistral models by using `--model-id`.
+You can specify any of the following Baichuan models via `openllm start`:
-- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
-- [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
-- [amazon/MistralLite](https://huggingface.co/amazon/MistralLite)
-- [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
-- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
-- Any other models that strictly follows the [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM) architecture
+- [baichuan-inc/baichuan-7b](https://huggingface.co/baichuan-inc/baichuan-7b)
+- [baichuan-inc/baichuan-13b-base](https://huggingface.co/baichuan-inc/baichuan-13b-base)
+- [baichuan-inc/baichuan-13b-chat](https://huggingface.co/baichuan-inc/baichuan-13b-chat)
+- [fireballoon/baichuan-vicuna-chinese-7b](https://huggingface.co/fireballoon/baichuan-vicuna-chinese-7b)
+- [fireballoon/baichuan-vicuna-7b](https://huggingface.co/fireballoon/baichuan-vicuna-7b)
+- [hiyouga/baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft)
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
- ```bash
- openllm start HuggingFaceH4/zephyr-7b-beta --backend pt
- ```
- vLLM (Recommended):
- ```bash
- pip install "openllm[vllm]"
- openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
- ```
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start baichuan-inc/baichuan-7b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start baichuan-inc/baichuan-7b --backend pt
+```
-Llama
-### Installation
+ChatGLM
-To run Llama models with OpenLLM, you need to install the `llama` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[llama]"
-```
### Quickstart
-Run the following commands to quickly spin up a Llama 2 server and send a request to it.
+Run the following command to quickly spin up a ChatGLM server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start thudm/chatglm-6b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start meta-llama/Llama-2-7b-chat-hf
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
-> [!NOTE]
-> To use the official Llama 2 models, you must gain access by visiting
-> the [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and
-> accepting its license terms and acceptable use policy. You also need to obtain access to these
-> models on [Hugging Face](https://huggingface.co/meta-llama). Note that any Llama 2 variants can
-> be deployed with OpenLLM if you don’t have access to the official Llama 2 model.
-> Visit the [Hugging Face Model Hub](https://huggingface.co/models?sort=trending&search=llama2) to see more Llama 2 compatible models.
+
+> **Note:** Any ChatGLM variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=chatglm) to see more ChatGLM-compatible models.
+
+
### Supported models
-You can specify any of the following Llama models by using `--model-id`.
+You can specify any of the following ChatGLM models via `openllm start`:
+
+
+- [thudm/chatglm-6b](https://huggingface.co/thudm/chatglm-6b)
+- [thudm/chatglm-6b-int8](https://huggingface.co/thudm/chatglm-6b-int8)
+- [thudm/chatglm-6b-int4](https://huggingface.co/thudm/chatglm-6b-int4)
+- [thudm/chatglm2-6b](https://huggingface.co/thudm/chatglm2-6b)
+- [thudm/chatglm2-6b-int4](https://huggingface.co/thudm/chatglm2-6b-int4)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- PyTorch:
+
+
+```bash
+openllm start thudm/chatglm-6b --backend pt
+```
+
+
+
+
+
+DollyV2
+
+
+### Quickstart
+
+Run the following command to quickly spin up a DollyV2 server:
+
+```bash
+openllm start databricks/dolly-v2-3b
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any DollyV2 variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=dolly_v2) to see more DollyV2-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following DollyV2 models via `openllm start`:
+
+
+- [databricks/dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b)
+- [databricks/dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b)
+- [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start databricks/dolly-v2-3b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start databricks/dolly-v2-3b --backend pt
+```
+
+
+
+
+
+Falcon
+
+
+### Quickstart
+
+Run the following command to quickly spin up a Falcon server:
+
+```bash
+openllm start tiiuae/falcon-7b
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any Falcon variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=falcon) to see more Falcon-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following Falcon models via `openllm start`:
+
+
+- [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
+- [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
+- [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
+- [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start tiiuae/falcon-7b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start tiiuae/falcon-7b --backend pt
+```
+
+
+
+
+
+FlanT5
+
+
+### Quickstart
+
+Run the following command to quickly spin up a FlanT5 server:
+
+```bash
+openllm start google/flan-t5-large
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any FlanT5 variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=flan_t5) to see more FlanT5-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following FlanT5 models via `openllm start`:
+
+
+- [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
+- [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)
+- [google/flan-t5-large](https://huggingface.co/google/flan-t5-large)
+- [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl)
+- [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- PyTorch:
+
+
+```bash
+openllm start google/flan-t5-small --backend pt
+```
+
+
+
+
+
+GPTNeoX
+
+
+### Quickstart
+
+Run the following command to quickly spin up a GPTNeoX server:
+
+```bash
+openllm start eleutherai/gpt-neox-20b
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any GPTNeoX variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=gpt_neox) to see more GPTNeoX-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following GPTNeoX models via `openllm start`:
+
+
+- [eleutherai/gpt-neox-20b](https://huggingface.co/eleutherai/gpt-neox-20b)
+### Supported backends
+
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start eleutherai/gpt-neox-20b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start eleutherai/gpt-neox-20b --backend pt
+```
+
+
+
+
+
+Llama
+
+
+### Quickstart
+
+Run the following command to quickly spin up a Llama server:
+
+```bash
+openllm start NousResearch/llama-2-7b-hf
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```
+
+
+> **Note:** Any Llama variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=llama) to see more Llama-compatible models.
+
+
+
+### Supported models
+
+You can specify any of the following Llama models via `openllm start`:
+
- [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
- [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
@@ -258,293 +587,139 @@ You can specify any of the following Llama models by using `--model-id`.
- [NousResearch/llama-2-70b-hf](https://huggingface.co/NousResearch/llama-2-70b-hf)
- [NousResearch/llama-2-13b-hf](https://huggingface.co/NousResearch/llama-2-13b-hf)
- [NousResearch/llama-2-7b-hf](https://huggingface.co/NousResearch/llama-2-7b-hf)
-- [openlm-research/open_llama_7b_v2](https://huggingface.co/openlm-research/open_llama_7b_v2)
-- [openlm-research/open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2)
-- [openlm-research/open_llama_13b](https://huggingface.co/openlm-research/open_llama_13b)
-- [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
-- [huggyllama/llama-30b](https://huggingface.co/huggyllama/llama-30b)
-- [huggyllama/llama-13b](https://huggingface.co/huggyllama/llama-13b)
-- [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b)
-- Any other models that strictly follows the [LlamaForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/llama#transformers.LlamaForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
- ```bash
- openllm start meta-llama/Llama-2-7b-chat-hf --backend pt
- ```
- vLLM (Recommended):
- ```bash
- pip install "openllm[llama, vllm]"
- openllm start meta-llama/Llama-2-7b-chat-hf --backend vllm
- ```
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start meta-llama/Llama-2-70b-chat-hf --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start meta-llama/Llama-2-70b-chat-hf --backend pt
+```
-ChatGLM
-### Installation
+Mistral
-To run ChatGLM models with OpenLLM, you need to install the `chatglm` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[chatglm]"
-```
### Quickstart
-Run the following commands to quickly spin up a ChatGLM server and send a request to it.
+Run the following command to quickly spin up a Mistral server:
+
+```bash
+openllm start mistralai/Mistral-7B-Instruct-v0.1
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start thudm/chatglm2-6b
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any Mistral variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=mistral) to see more Mistral-compatible models.
+
+
+
### Supported models
-You can specify any of the following ChatGLM models by using `--model-id`.
+You can specify any of the following Mistral models via `openllm start`:
-- [thudm/chatglm-6b](https://huggingface.co/thudm/chatglm-6b)
-- [thudm/chatglm-6b-int8](https://huggingface.co/thudm/chatglm-6b-int8)
-- [thudm/chatglm-6b-int4](https://huggingface.co/thudm/chatglm-6b-int4)
-- [thudm/chatglm2-6b](https://huggingface.co/thudm/chatglm2-6b)
-- [thudm/chatglm2-6b-int4](https://huggingface.co/thudm/chatglm2-6b-int4)
-- Any other models that strictly follows the [ChatGLMForConditionalGeneration](https://github.com/THUDM/ChatGLM-6B) architecture
+- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
+- [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
+- [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
+- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start thudm/chatglm2-6b --backend pt
- ```
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start HuggingFaceH4/zephyr-7b-alpha --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start HuggingFaceH4/zephyr-7b-alpha --backend pt
+```
-Dolly-v2
-### Installation
-
-Dolly-v2 models do not require you to install any model-specific dependencies once you have `openllm` installed.
-
-```bash
-pip install openllm
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a Dolly-v2 server and send a request to it.
-
-```bash
-openllm start databricks/dolly-v2-3b
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following Dolly-v2 models by using `--model-id`.
-
-- [databricks/dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b)
-- [databricks/dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b)
-- [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
-- Any other models that strictly follows the [GPTNeoXForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_neox#transformers.GPTNeoXForCausalLM) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start databricks/dolly-v2-3b --backend pt
- ```
-
-- vLLM:
-
- ```bash
- openllm start databricks/dolly-v2-3b --backend vllm
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
-Falcon
-
-### Installation
-
-To run Falcon models with OpenLLM, you need to install the `falcon` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[falcon]"
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a Falcon server and send a request to it.
-
-```bash
-openllm start tiiuae/falcon-7b
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following Falcon models by using `--model-id`.
-
-- [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b)
-- [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
-- [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
-- [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct)
-- Any other models that strictly follows the [FalconForCausalLM](https://falconllm.tii.ae/) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start tiiuae/falcon-7b --backend pt
- ```
-
-- vLLM:
-
- ```bash
- pip install "openllm[falcon, vllm]"
- openllm start tiiuae/falcon-7b --backend vllm
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
-Flan-T5
-
-### Installation
-
-To run Flan-T5 models with OpenLLM, you need to install the `flan-t5` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[flan-t5]"
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a Flan-T5 server and send a request to it.
-
-```bash
-openllm start google/flan-t5-large
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following Flan-T5 models by using `--model-id`.
-
-- [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
-- [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)
-- [google/flan-t5-large](https://huggingface.co/google/flan-t5-large)
-- [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl)
-- [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl)
-- Any other models that strictly follows the [T5ForConditionalGeneration](https://huggingface.co/docs/transformers/main/model_doc/t5#transformers.T5ForConditionalGeneration) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start google/flan-t5-large --backend pt
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
-GPT-NeoX
-
-### Installation
-
-GPT-NeoX models do not require you to install any model-specific dependencies once you have `openllm` installed.
-
-```bash
-pip install openllm
-```
-
-### Quickstart
-
-Run the following commands to quickly spin up a GPT-NeoX server and send a request to it.
-
-```bash
-openllm start eleutherai/gpt-neox-20b
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'What are large language models?'
-```
-
-### Supported models
-
-You can specify any of the following GPT-NeoX models by using `--model-id`.
-
-- [eleutherai/gpt-neox-20b](https://huggingface.co/eleutherai/gpt-neox-20b)
-- Any other models that strictly follows the [GPTNeoXForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_neox#transformers.GPTNeoXForCausalLM) architecture
-
-### Supported backends
-
-- PyTorch (Default):
-
- ```bash
- openllm start eleutherai/gpt-neox-20b --backend pt
- ```
-
-- vLLM:
-
- ```bash
- openllm start eleutherai/gpt-neox-20b --backend vllm
- ```
-
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
-
-
-
-
MPT
-### Installation
-
-To run MPT models with OpenLLM, you need to install the `mpt` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[mpt]"
-```
### Quickstart
-Run the following commands to quickly spin up a MPT server and send a request to it.
+Run the following command to quickly spin up a MPT server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start mosaicml/mpt-7b-instruct
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start mosaicml/mpt-7b-chat
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any MPT variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=mpt) to see more MPT-compatible models.
+
+
+
### Supported models
-You can specify any of the following MPT models by using `--model-id`.
+You can specify any of the following MPT models via `openllm start`:
+
- [mosaicml/mpt-7b](https://huggingface.co/mosaicml/mpt-7b)
- [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct)
@@ -553,52 +728,70 @@ You can specify any of the following MPT models by using `--model-id`.
- [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b)
- [mosaicml/mpt-30b-instruct](https://huggingface.co/mosaicml/mpt-30b-instruct)
- [mosaicml/mpt-30b-chat](https://huggingface.co/mosaicml/mpt-30b-chat)
-- Any other models that strictly follows the [MPTForCausalLM](https://huggingface.co/mosaicml) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
+
+
+
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
+
- ```bash
- openllm start mosaicml/mpt-7b-chat --backend pt
- ```
- vLLM (Recommended):
- ```bash
- pip install "openllm[mpt, vllm]"
- openllm start mosaicml/mpt-7b-chat --backend vllm
- ```
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start mosaicml/mpt-7b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start mosaicml/mpt-7b --backend pt
+```
+
OPT
-### Installation
-
-To run OPT models with OpenLLM, you need to install the `opt` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[opt]"
-```
### Quickstart
-Run the following commands to quickly spin up an OPT server and send a request to it.
+Run the following command to quickly spin up a OPT server:
+
+```bash
+openllm start facebook/opt-1.3b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start facebook/opt-2.7b
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any OPT variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=opt) to see more OPT-compatible models.
+
+
+
### Supported models
-You can specify any of the following OPT models by using `--model-id`.
+You can specify any of the following OPT models via `openllm start`:
+
- [facebook/opt-125m](https://huggingface.co/facebook/opt-125m)
- [facebook/opt-350m](https://huggingface.co/facebook/opt-350m)
@@ -606,180 +799,249 @@ You can specify any of the following OPT models by using `--model-id`.
- [facebook/opt-2.7b](https://huggingface.co/facebook/opt-2.7b)
- [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b)
- [facebook/opt-66b](https://huggingface.co/facebook/opt-66b)
-- Any other models that strictly follows the [OPTForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/opt#transformers.OPTForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start facebook/opt-2.7b --backend pt
- ```
-- vLLM:
- ```bash
- pip install "openllm[opt, vllm]"
- openllm start facebook/opt-2.7b --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start facebook/opt-125m --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start facebook/opt-125m --backend pt
+```
+
StableLM
-### Installation
-
-StableLM models do not require you to install any model-specific dependencies once you have `openllm` installed.
-
-```bash
-pip install openllm
-```
### Quickstart
-Run the following commands to quickly spin up a StableLM server and send a request to it.
+Run the following command to quickly spin up a StableLM server:
+
+```bash
+openllm start stabilityai/stablelm-tuned-alpha-3b
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start stabilityai/stablelm-tuned-alpha-7b
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any StableLM variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=stablelm) to see more StableLM-compatible models.
+
+
+
### Supported models
-You can specify any of the following StableLM models by using `--model-id`.
+You can specify any of the following StableLM models via `openllm start`:
+
- [stabilityai/stablelm-tuned-alpha-3b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b)
- [stabilityai/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
- [stabilityai/stablelm-base-alpha-3b](https://huggingface.co/stabilityai/stablelm-base-alpha-3b)
- [stabilityai/stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b)
-- Any other models that strictly follows the [GPTNeoXForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_neox#transformers.GPTNeoXForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start stabilityai/stablelm-tuned-alpha-7b --backend pt
- ```
-- vLLM:
- ```bash
- openllm start stabilityai/stablelm-tuned-alpha-7b --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start stabilityai/stablelm-tuned-alpha-3b --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start stabilityai/stablelm-tuned-alpha-3b --backend pt
+```
+
StarCoder
-### Installation
-
-To run StarCoder models with OpenLLM, you need to install the `starcoder` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[starcoder]"
-```
### Quickstart
-Run the following commands to quickly spin up a StarCoder server and send a request to it.
+Run the following command to quickly spin up a StarCoder server:
```bash
openllm start bigcode/starcoder
+```
+In a different terminal, run the following command to interact with the server:
+
+```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any StarCoder variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=starcoder) to see more StarCoder-compatible models.
+
+
+
### Supported models
-You can specify any of the following StarCoder models by using `--model-id`.
+You can specify any of the following StarCoder models via `openllm start`:
+
- [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)
- [bigcode/starcoderbase](https://huggingface.co/bigcode/starcoderbase)
-- Any other models that strictly follows the [GPTBigCodeForCausalLM](https://huggingface.co/docs/transformers/main/model_doc/gpt_bigcode#transformers.GPTBigCodeForCausalLM) architecture
-
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start bigcode/starcoder --backend pt
- ```
-- vLLM:
- ```bash
- pip install "openllm[startcoder, vllm]"
- openllm start bigcode/starcoder --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start bigcode/starcoder --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start bigcode/starcoder --backend pt
+```
-Baichuan
-### Installation
+Yi
-To run Baichuan models with OpenLLM, you need to install the `baichuan` dependency as it is not installed by default.
-
-```bash
-pip install "openllm[baichuan]"
-```
### Quickstart
-Run the following commands to quickly spin up a Baichuan server and send a request to it.
+Run the following command to quickly spin up a Yi server:
+
+```bash
+TRUST_REMOTE_CODE=True openllm start 01-ai/Yi-6B
+```
+In a different terminal, run the following command to interact with the server:
```bash
-openllm start baichuan-inc/baichuan-13b-base
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'
```
+
+> **Note:** Any Yi variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=yi) to see more Yi-compatible models.
+
+
+
### Supported models
-You can specify any of the following Baichuan models by using `--model-id`.
+You can specify any of the following Yi models via `openllm start`:
-- [baichuan-inc/baichuan-7b](https://huggingface.co/baichuan-inc/baichuan-7b)
-- [baichuan-inc/baichuan-13b-base](https://huggingface.co/baichuan-inc/baichuan-13b-base)
-- [baichuan-inc/baichuan-13b-chat](https://huggingface.co/baichuan-inc/baichuan-13b-chat)
-- [fireballoon/baichuan-vicuna-chinese-7b](https://huggingface.co/fireballoon/baichuan-vicuna-chinese-7b)
-- [fireballoon/baichuan-vicuna-7b](https://huggingface.co/fireballoon/baichuan-vicuna-7b)
-- [hiyouga/baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft)
-- Any other models that strictly follows the [BaiChuanForCausalLM](https://github.com/baichuan-inc/Baichuan-7B) architecture
+- [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B)
+- [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)
+- [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K)
+- [01-ai/Yi-34B-200K](https://huggingface.co/01-ai/Yi-34B-200K)
### Supported backends
-- PyTorch (Default):
+OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.
- ```bash
- openllm start baichuan-inc/baichuan-13b-base --backend pt
- ```
-- vLLM:
- ```bash
- pip install "openllm[baichuan, vllm]"
- openllm start baichuan-inc/baichuan-13b-base --backend vllm
- ```
+> **Important:** We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.
-> [!NOTE]
-> Currently when using the vLLM backend, adapters is yet to be supported.
+
+
+- vLLM (Recommended):
+
+
+To install vLLM, run `pip install "openllm[vllm]"`
+
+```bash
+openllm start 01-ai/Yi-6B --backend vllm
+```
+
+
+> **Important:** Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.
+
+
+
+> **Note:** Currently, adapters are yet to be supported with vLLM.
+
+
+- PyTorch:
+
+
+```bash
+openllm start 01-ai/Yi-6B --backend pt
+```
+
+
More models will be integrated with OpenLLM and we welcome your contributions if you want to incorporate your custom LLMs into the ecosystem. Check out [Adding a New Model Guide](https://github.com/bentoml/OpenLLM/blob/main/ADDING_NEW_MODEL.md) to learn more.
## 💻 Run your model on multiple GPUs
@@ -835,7 +1097,7 @@ With PyTorch backend, OpenLLM supports `int8`, `int4`, `gptq`
For using int8 and int4 quantization through `bitsandbytes`, you can use the following command:
```bash
-openllm start opt --quantize int8
+openllm start facebook/opt-6.7b --quantize int8
```
To run inference with `gptq`, simply pass `--quantize gptq`:
@@ -856,7 +1118,7 @@ With vLLM backend, OpenLLM supports `awq`, `squeezellm`
To run inference with `awq`, simply pass `--quantize awq`:
```bash
-openllm start mistral --model-id TheBloke/zephyr-7B-alpha-AWQ --quantize awq
+openllm start TheBloke/zephyr-7B-alpha-AWQ --quantize awq
```
To run inference with `squeezellm`, simply pass `--quantize squeezellm`:
@@ -875,19 +1137,19 @@ openllm start squeeze-ai-lab/sq-llama-2-7b-w4-s0 --quantize squeezellm --seriali
With OpenLLM, you can take advantage of the fine-tuning feature by serving models with any PEFT-compatible layers using the `--adapter-id` option. For example:
```bash
-openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6-7b-quotes:default
+openllm start facebook/opt-6.7b --adapter-id aarnphm/opt-6-7b-quotes:default
```
OpenLLM also provides flexibility by supporting adapters from custom file paths:
```bash
-openllm start opt --model-id facebook/opt-6.7b --adapter-id /path/to/adapters:local_adapter
+openllm start facebook/opt-6.7b --adapter-id /path/to/adapters:local_adapter
```
To use multiple adapters, use the following format:
```bash
-openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora:default --adapter-id aarnphm/opt-6.7b-french:french_lora
+openllm start facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora:default --adapter-id aarnphm/opt-6.7b-french:french_lora
```
By default, all adapters will be injected into the models during startup. Adapters can be specified per request via `adapter_name`:
@@ -927,15 +1189,6 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
> [!IMPORTANT]
> Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.
-## 🥅 Playground and Chat UI
-
-The following UIs are currently available for OpenLLM:
-
-| UI | Owner | Type | Progress |
-| ---------------------------------------------------------------------------------- | -------------------------------------------- | -------------------- | -------- |
-| [Clojure](https://github.com/bentoml/OpenLLM/blob/main/external/clojure/README.md) | [@GutZuFusss](https://github.com/GutZuFusss) | Community-maintained | 🔧 |
-| TS | BentoML Team | | 🚧 |
-
## 🐍 Python SDK
Each LLM can be instantiated with `openllm.LLM`:
@@ -978,9 +1231,30 @@ This method is easy to use for one-shot generation use case, but merely served a
OpenLLM is not just a standalone product; it's a building block designed to
integrate with other powerful tools easily. We currently offer integration with
[BentoML](https://github.com/bentoml/BentoML),
+[OpenAI's Compatible Endpoints](https://platform.openai.com/docs/api-reference/completions/object),
[LangChain](https://github.com/hwchase17/langchain), and
[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).
+### OpenAI Compatible Endpoints
+
+OpenLLM Server can be used as a drop-in replacement for OpenAI's API. Simply
+specify the base_url to `llm-endpoint/v1` and you are good to go:
+
+```python
+import openai
+client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na') # Here the server is running on localhost:3000
+
+completions = client.completions.create(
+ prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
+)
+```
+
+The compatible endpoints supports `/completions`, `/chat/completions`, and `/models`
+
+> [!NOTE]
+> You can find out OpenAI example clients under the
+> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
+
### BentoML
OpenLLM LLM can be integrated as a
diff --git a/tools/update-readme.py b/tools/update-readme.py
new file mode 100755
index 00000000..6a57306e
--- /dev/null
+++ b/tools/update-readme.py
@@ -0,0 +1,109 @@
+#!/usr/bin/env python3
+import os
+import shutil
+import sys
+
+import tomlkit
+
+START_COMMENT = f'\n'
+END_COMMENT = f'\n'
+
+ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.insert(0, os.path.join(ROOT, 'openllm-core', 'src'))
+from openllm_core.config import CONFIG_MAPPING
+
+
+def markdown_noteblock(text: str):
+ return ['\n', f'> **Note:** {text}\n']
+
+
+def markdown_importantblock(text: str):
+ return ['\n', f'> **Important:** {text}\n']
+
+
+def main() -> int:
+ with open(os.path.join(ROOT, 'openllm-python', 'pyproject.toml'), 'r') as f:
+ deps = tomlkit.parse(f.read()).value['project']['optional-dependencies']
+ with open(os.path.join(ROOT, 'README.md'), 'r') as f:
+ readme = f.readlines()
+
+ start_index, stop_index = readme.index(START_COMMENT), readme.index(END_COMMENT)
+
+ content = []
+
+ for it in CONFIG_MAPPING.values():
+ it = it()
+ details_block = ['\n']
+ architecture_name = it.__class__.__name__[:-6]
+ details_block.extend(
+ [
+ f'{architecture_name}
\n\n',
+ '### Quickstart\n',
+ f'Run the following command to quickly spin up a {architecture_name} server:\n',
+ f"""\
+```bash
+{'' if not it['trust_remote_code'] else 'TRUST_REMOTE_CODE=True '}openllm start {it['default_id']}
+```""",
+ 'In a different terminal, run the following command to interact with the server:\n',
+ """\
+```bash
+export OPENLLM_ENDPOINT=http://localhost:3000
+openllm query 'What are large language models?'
+```""",
+ *markdown_noteblock(
+ f"Any {architecture_name} variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search={it['model_name']}) to see more {architecture_name}-compatible models.\n"
+ ),
+ '\n### Supported models\n',
+ f'You can specify any of the following {architecture_name} models via `openllm start`:\n\n',
+ ]
+ )
+ list_ids = [f'- [{model_id}](https://huggingface.co/{model_id})' for model_id in it['model_ids']]
+ details_block.extend(list_ids)
+ details_block.extend(
+ [
+ '### Supported backends\n',
+ 'OpenLLM will support vLLM and PyTorch as default backend. By default, it will use vLLM if vLLM is available, otherwise fallback to PyTorch.\n',
+ *markdown_importantblock(
+ 'We recommend user to explicitly specify `--backend` to choose the desired backend to run the model. If you have access to a GPU, always use `--backend vllm`.\n'
+ ),
+ ]
+ )
+ if 'vllm' in it['backend']:
+ details_block.extend(
+ [
+ '\n- vLLM (Recommended):\n\n',
+ 'To install vLLM, run `pip install "openllm[vllm]"`\n',
+ f"""\
+```bash
+openllm start {it['model_ids'][0]} --backend vllm
+```""",
+ *markdown_importantblock(
+ 'Using vLLM requires a GPU that has architecture newer than 8.0 to get the best performance for serving. It is recommended that for all serving usecase in production, you should choose vLLM for serving.'
+ ),
+ *markdown_noteblock('Currently, adapters are yet to be supported with vLLM.'),
+ ]
+ )
+ if 'pt' in it['backend']:
+ details_block.extend(
+ [
+ '\n- PyTorch:\n\n',
+ f"""\
+```bash
+openllm start {it['model_ids'][0]} --backend pt
+```""",
+ ]
+ )
+
+ details_block.append('\n \n\n')
+ content.append('\n'.join(details_block))
+
+ readme = readme[:start_index] + [START_COMMENT] + content + [END_COMMENT] + readme[stop_index + 1 :]
+ with open(os.path.join(ROOT, 'README.md'), 'w') as f:
+ f.writelines(readme)
+
+ shutil.copyfile(os.path.join(ROOT, 'README.md'), os.path.join(ROOT, 'openllm-python', 'README.md'))
+ return 0
+
+
+if __name__ == '__main__':
+ raise SystemExit(main())