diff --git a/README.md b/README.md index 73805195..a2c87ee6 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,935 @@ +![Banner for OpenLLM](/.github/assets/main-banner.png) + + + +
+

🦾 OpenLLM: Self-Hosting LLMs Made Easy

+ + pypi_status + + test_pypi_status + + ci + + pre-commit.ci status +
+ Twitter + + Discord + +
+ +## 📖 Introduction + +OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment. + +- 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data +- ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs +- 🔥 State-of-the-art serving and inference performance +- 🎯 Simplified cloud deployment via [BentoML](https://www.bentoml.com) + + + +![Gif showing OpenLLM Intro](/.github/assets/output.gif) + +
+ + + +## 💾 TL/DR + +For starter, we provide two ways to quickly try out OpenLLM: + +### Jupyter Notebooks + +Try this [OpenLLM tutorial in Google Colab: Serving Phi 3 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb). + +## 🏃 Get started + +The following provides instructions for how to get started with OpenLLM locally. + +### Prerequisites + +You have installed Python 3.9 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts. + +### Install OpenLLM + +Install OpenLLM by using `pip` as follows: + +```bash +pip install openllm ``` -pip install . -openllm serve -# or openllm run + +To verify the installation, run: + +```bash +$ openllm -h ``` -To find out what LLM models are already in your hands. -License -------- +### Start a LLM server -This project is licensed under the MIT License - see the LICENSE file for details. +OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) server, run the following: -Acknowledgements ----------------- +```bash +openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code +``` -This project makes use of the following open-source projects: +To interact with the server, you can visit the web UI at [http://0.0.0.0:3000/](http://0.0.0.0:3000/) or send a request using `curl`. You can also use OpenLLM’s built-in Python client to interact with the server: -* [bentoml/bentoml](https://github.com/bentoml/bentoml) for production level model serving -* [blrchen/chatgpt-lite](https://github.com/blrchen/chatgpt-lite) for a fancy Web Chat UI -* [chujiezheng/chat_templates](https://github.com/chujiezheng/chat_templates) +```python +import openllm -We are grateful to the developers and contributors of these projects for their hard work and dedication. +client = openllm.HTTPClient('http://localhost:3000') +client.generate('Explain to me the difference between "further" and "farther"') +``` + +OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example: + +```bash +openllm start -- +``` + +## 🧩 Supported models + +OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below. + + +
+ +Baichuan + + +### Quickstart + +Run the following command to quickly spin up a Baichuan server: + +```bash +openllm start baichuan-inc/baichuan-7b --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Baichuan variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=baichuan) to see more Baichuan-compatible models. + + + +### Supported models + +You can specify any of the following Baichuan models via `openllm start`: + + +- [baichuan-inc/baichuan2-7b-base](https://huggingface.co/baichuan-inc/baichuan2-7b-base) +- [baichuan-inc/baichuan2-7b-chat](https://huggingface.co/baichuan-inc/baichuan2-7b-chat) +- [baichuan-inc/baichuan2-13b-base](https://huggingface.co/baichuan-inc/baichuan2-13b-base) +- [baichuan-inc/baichuan2-13b-chat](https://huggingface.co/baichuan-inc/baichuan2-13b-chat) + +
+ +
+ +ChatGLM + + +### Quickstart + +Run the following command to quickly spin up a ChatGLM server: + +```bash +openllm start thudm/chatglm-6b --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any ChatGLM variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=chatglm) to see more ChatGLM-compatible models. + + + +### Supported models + +You can specify any of the following ChatGLM models via `openllm start`: + + +- [thudm/chatglm-6b](https://huggingface.co/thudm/chatglm-6b) +- [thudm/chatglm-6b-int8](https://huggingface.co/thudm/chatglm-6b-int8) +- [thudm/chatglm-6b-int4](https://huggingface.co/thudm/chatglm-6b-int4) +- [thudm/chatglm2-6b](https://huggingface.co/thudm/chatglm2-6b) +- [thudm/chatglm2-6b-int4](https://huggingface.co/thudm/chatglm2-6b-int4) +- [thudm/chatglm3-6b](https://huggingface.co/thudm/chatglm3-6b) + +
+ +
+ +Cohere + + +### Quickstart + +Run the following command to quickly spin up a Cohere server: + +```bash +openllm start CohereForAI/c4ai-command-r-plus --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Cohere variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=commandr) to see more Cohere-compatible models. + + + +### Supported models + +You can specify any of the following Cohere models via `openllm start`: + + +- [CohereForAI/c4ai-command-r-plus](https://huggingface.co/CohereForAI/c4ai-command-r-plus) +- [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) + +
+ +
+ +Dbrx + + +### Quickstart + +Run the following command to quickly spin up a Dbrx server: + +```bash +openllm start databricks/dbrx-instruct --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Dbrx variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=dbrx) to see more Dbrx-compatible models. + + + +### Supported models + +You can specify any of the following Dbrx models via `openllm start`: + + +- [databricks/dbrx-instruct](https://huggingface.co/databricks/dbrx-instruct) +- [databricks/dbrx-base](https://huggingface.co/databricks/dbrx-base) + +
+ +
+ +DollyV2 + + +### Quickstart + +Run the following command to quickly spin up a DollyV2 server: + +```bash +openllm start databricks/dolly-v2-3b --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any DollyV2 variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=dolly_v2) to see more DollyV2-compatible models. + + + +### Supported models + +You can specify any of the following DollyV2 models via `openllm start`: + + +- [databricks/dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b) +- [databricks/dolly-v2-7b](https://huggingface.co/databricks/dolly-v2-7b) +- [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) + +
+ +
+ +Falcon + + +### Quickstart + +Run the following command to quickly spin up a Falcon server: + +```bash +openllm start tiiuae/falcon-7b --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Falcon variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=falcon) to see more Falcon-compatible models. + + + +### Supported models + +You can specify any of the following Falcon models via `openllm start`: + + +- [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) +- [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b) +- [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) +- [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) + +
+ +
+ +Gemma + + +### Quickstart + +Run the following command to quickly spin up a Gemma server: + +```bash +openllm start google/gemma-7b --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Gemma variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=gemma) to see more Gemma-compatible models. + + + +### Supported models + +You can specify any of the following Gemma models via `openllm start`: + + +- [google/gemma-7b](https://huggingface.co/google/gemma-7b) +- [google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) +- [google/gemma-2b](https://huggingface.co/google/gemma-2b) +- [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) + +
+ +
+ +GPTNeoX + + +### Quickstart + +Run the following command to quickly spin up a GPTNeoX server: + +```bash +openllm start eleutherai/gpt-neox-20b --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any GPTNeoX variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=gpt_neox) to see more GPTNeoX-compatible models. + + + +### Supported models + +You can specify any of the following GPTNeoX models via `openllm start`: + + +- [eleutherai/gpt-neox-20b](https://huggingface.co/eleutherai/gpt-neox-20b) + +
+ +
+ +Llama + + +### Quickstart + +Run the following command to quickly spin up a Llama server: + +```bash +openllm start NousResearch/llama-2-7b-hf --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Llama variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=llama) to see more Llama-compatible models. + + + +### Supported models + +You can specify any of the following Llama models via `openllm start`: + + +- [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) +- [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) +- [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) +- [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) +- [meta-llama/Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf) +- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) +- [NousResearch/llama-2-70b-chat-hf](https://huggingface.co/NousResearch/llama-2-70b-chat-hf) +- [NousResearch/llama-2-13b-chat-hf](https://huggingface.co/NousResearch/llama-2-13b-chat-hf) +- [NousResearch/llama-2-7b-chat-hf](https://huggingface.co/NousResearch/llama-2-7b-chat-hf) +- [NousResearch/llama-2-70b-hf](https://huggingface.co/NousResearch/llama-2-70b-hf) +- [NousResearch/llama-2-13b-hf](https://huggingface.co/NousResearch/llama-2-13b-hf) +- [NousResearch/llama-2-7b-hf](https://huggingface.co/NousResearch/llama-2-7b-hf) + +
+ +
+ +Mistral + + +### Quickstart + +Run the following command to quickly spin up a Mistral server: + +```bash +openllm start mistralai/Mistral-7B-Instruct-v0.1 --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Mistral variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=mistral) to see more Mistral-compatible models. + + + +### Supported models + +You can specify any of the following Mistral models via `openllm start`: + + +- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) +- [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) +- [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) +- [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) +- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) + +
+ +
+ +Mixtral + + +### Quickstart + +Run the following command to quickly spin up a Mixtral server: + +```bash +openllm start mistralai/Mixtral-8x7B-Instruct-v0.1 --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Mixtral variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=mixtral) to see more Mixtral-compatible models. + + + +### Supported models + +You can specify any of the following Mixtral models via `openllm start`: + + +- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) +- [mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) + +
+ +
+ +MPT + + +### Quickstart + +Run the following command to quickly spin up a MPT server: + +```bash +openllm start mosaicml/mpt-7b-instruct --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any MPT variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=mpt) to see more MPT-compatible models. + + + +### Supported models + +You can specify any of the following MPT models via `openllm start`: + + +- [mosaicml/mpt-7b](https://huggingface.co/mosaicml/mpt-7b) +- [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct) +- [mosaicml/mpt-7b-chat](https://huggingface.co/mosaicml/mpt-7b-chat) +- [mosaicml/mpt-7b-storywriter](https://huggingface.co/mosaicml/mpt-7b-storywriter) +- [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b) +- [mosaicml/mpt-30b-instruct](https://huggingface.co/mosaicml/mpt-30b-instruct) +- [mosaicml/mpt-30b-chat](https://huggingface.co/mosaicml/mpt-30b-chat) + +
+ +
+ +OPT + + +### Quickstart + +Run the following command to quickly spin up a OPT server: + +```bash +openllm start facebook/opt-1.3b +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any OPT variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=opt) to see more OPT-compatible models. + + + +### Supported models + +You can specify any of the following OPT models via `openllm start`: + + +- [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) +- [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) +- [facebook/opt-1.3b](https://huggingface.co/facebook/opt-1.3b) +- [facebook/opt-2.7b](https://huggingface.co/facebook/opt-2.7b) +- [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) +- [facebook/opt-66b](https://huggingface.co/facebook/opt-66b) + +
+ +
+ +Phi + + +### Quickstart + +Run the following command to quickly spin up a Phi server: + +```bash +openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Phi variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=phi) to see more Phi-compatible models. + + + +### Supported models + +You can specify any of the following Phi models via `openllm start`: + + +- [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) +- [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) +- [microsoft/Phi-3-small-8k-instruct](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) +- [microsoft/Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct) +- [microsoft/Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) +- [microsoft/Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) + +
+ +
+ +Qwen + + +### Quickstart + +Run the following command to quickly spin up a Qwen server: + +```bash +openllm start qwen/Qwen-7B-Chat --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Qwen variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=qwen) to see more Qwen-compatible models. + + + +### Supported models + +You can specify any of the following Qwen models via `openllm start`: + + +- [qwen/Qwen-7B-Chat](https://huggingface.co/qwen/Qwen-7B-Chat) +- [qwen/Qwen-7B-Chat-Int8](https://huggingface.co/qwen/Qwen-7B-Chat-Int8) +- [qwen/Qwen-7B-Chat-Int4](https://huggingface.co/qwen/Qwen-7B-Chat-Int4) +- [qwen/Qwen-14B-Chat](https://huggingface.co/qwen/Qwen-14B-Chat) +- [qwen/Qwen-14B-Chat-Int8](https://huggingface.co/qwen/Qwen-14B-Chat-Int8) +- [qwen/Qwen-14B-Chat-Int4](https://huggingface.co/qwen/Qwen-14B-Chat-Int4) + +
+ +
+ +StableLM + + +### Quickstart + +Run the following command to quickly spin up a StableLM server: + +```bash +openllm start stabilityai/stablelm-tuned-alpha-3b --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any StableLM variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=stablelm) to see more StableLM-compatible models. + + + +### Supported models + +You can specify any of the following StableLM models via `openllm start`: + + +- [stabilityai/stablelm-tuned-alpha-3b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b) +- [stabilityai/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b) +- [stabilityai/stablelm-base-alpha-3b](https://huggingface.co/stabilityai/stablelm-base-alpha-3b) +- [stabilityai/stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) + +
+ +
+ +StarCoder + + +### Quickstart + +Run the following command to quickly spin up a StarCoder server: + +```bash +openllm start bigcode/starcoder --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any StarCoder variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=starcoder) to see more StarCoder-compatible models. + + + +### Supported models + +You can specify any of the following StarCoder models via `openllm start`: + + +- [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) +- [bigcode/starcoderbase](https://huggingface.co/bigcode/starcoderbase) + +
+ +
+ +Yi + + +### Quickstart + +Run the following command to quickly spin up a Yi server: + +```bash +openllm start 01-ai/Yi-6B --trust-remote-code +``` +You can run the following code in a different terminal to interact with the server: +```python +import openllm_client +client = openllm_client.HTTPClient('http://localhost:3000') +client.generate('What are large language models?') +``` + + +> **Note:** Any Yi variants can be deployed with OpenLLM. Visit the [HuggingFace Model Hub](https://huggingface.co/models?sort=trending&search=yi) to see more Yi-compatible models. + + + +### Supported models + +You can specify any of the following Yi models via `openllm start`: + + +- [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) +- [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B) +- [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K) +- [01-ai/Yi-34B-200K](https://huggingface.co/01-ai/Yi-34B-200K) + +
+ + + +More models will be integrated with OpenLLM and we welcome your contributions if you want to incorporate your custom LLMs into the ecosystem. Check out [Adding a New Model Guide](https://github.com/bentoml/OpenLLM/blob/main/ADDING_NEW_MODEL.md) to learn more. + +## 📐 Quantization + +Quantization is a technique to reduce the storage and computation requirements for machine learning models, particularly during inference. By approximating floating-point numbers as integers (quantized values), quantization allows for faster computations, reduced memory footprint, and can make it feasible to deploy large models on resource-constrained devices. + +OpenLLM supports the following quantization techniques + +- [AWQ: Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978). +- [GPTQ: Accurate Post-Training Quantization](https://arxiv.org/abs/2210.17323). +- [SqueezeLLM: Dense-and-Sparse Quantization](https://arxiv.org/abs/2306.07629). + +> [!NOTE] +> Make sure to use pre-quantized models weights when using with `openllm start`. + +## ⚙️ Integrations + +OpenLLM is not just a standalone product; it's a building block designed to +integrate with other powerful tools easily. We currently offer integration with +[OpenAI's Compatible Endpoints](https://platform.openai.com/docs/api-reference/completions/object), +[LlamaIndex](https://www.llamaindex.ai/), +[LangChain](https://github.com/hwchase17/langchain). + +### OpenAI Compatible Endpoints + +OpenLLM Server can be used as a drop-in replacement for OpenAI's API. Simply +specify the base_url to `llm-endpoint/v1` and you are good to go: + +```python +import openai + +client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na') # Here the server is running on 0.0.0.0:3000 + +completions = client.chat.completions.create( + prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream +) +``` + +The compatible endpoints supports `/chat/completions`, and `/models` + +> [!NOTE] +> You can find out OpenAI example clients under the +> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder. + +### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/) + +You can use `llama_index.llms.openllm.OpenLLMAPI` to interact with a LLM running server: + +```python +from llama_index.llms.openllm import OpenLLMAPI +``` + +> [!NOTE] +> All synchronous and asynchronous API from `llama_index.llms.OpenLLMAPI` are supported. +> Make sure to install `llama-index-integrations-llm-openllm` to use the supported class. + +### [LangChain](https://python.langchain.com/docs/integrations/llms/openllm/) + +Spin up an OpenLLM server, and connect to it by specifying its URL: + +```python +from langchain.llms import OpenLLMAPI + +llm = OpenLLMAPI(server_url='http://44.23.123.1:3000') +llm.invoke('What is the difference between a duck and a goose? And why there are so many Goose in Canada?') + +# streaming +for it in llm.stream('What is the difference between a duck and a goose? And why there are so many Goose in Canada?'): + print(it, flush=True, end='') + +# async context +await llm.ainvoke('What is the difference between a duck and a goose? And why there are so many Goose in Canada?') + +# async streaming +async for it in llm.astream('What is the difference between a duck and a goose? And why there are so many Goose in Canada?'): + print(it, flush=True, end='') +``` + + + + + +## 🚀 Deploying models to production + +There are several ways to deploy your LLMs: + +### 🐳 Docker container + +1. **Building a Bento**: With OpenLLM, you can easily build a Bento for a + specific model, like `mistralai/Mistral-7B-Instruct-v0.1`, using the `build` command.: + + ```bash + openllm build mistralai/Mistral-7B-Instruct-v0.1 + ``` + + A + [Bento](https://docs.bentoml.com/en/latest/concepts/bento.html#what-is-a-bento), + in BentoML, is the unit of distribution. It packages your program's source + code, models, files, artefacts, and dependencies. + +2. **Containerize your Bento** + + ```bash + bentoml containerize + ``` + + This generates a OCI-compatible docker image that can be deployed anywhere + docker runs. For best scalability and reliability of your LLM service in + production, we recommend deploy with BentoCloud。 + +### ☁️ BentoCloud + +Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/), the inference platform +for fast moving AI teams. + +1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/) + +2. **Log into your BentoCloud account:** + + ```bash + bentoml cloud login --api-token --endpoint + ``` + +> [!NOTE] +> Replace `` and `` with your +> specific API token and the BentoCloud endpoint respectively. + +3. **Bulding a Bento**: With OpenLLM, you can easily build a Bento for a + specific model, such as `mistralai/Mistral-7B-Instruct-v0.1`: + + ```bash + openllm build mistralai/Mistral-7B-Instruct-v0.1 + ``` + +4. **Pushing a Bento**: Push your freshly-built Bento service to BentoCloud via + the `push` command: + + ```bash + bentoml push + ``` + +5. **Deploying a Bento**: Deploy your LLMs to BentoCloud with a single + `bentoml deployment create` command following the + [deployment instructions](https://docs.bentoml.com/en/latest/reference/cli.html#bentoml-deployment-create). + +## 👥 Community + +Engage with like-minded individuals passionate about LLMs, AI, and more on our +[Discord](https://l.bentoml.com/join-openllm-discord)! + +OpenLLM is actively maintained by the BentoML team. Feel free to reach out and +join us in our pursuit to make LLMs more accessible and easy to use 👉 +[Join our Slack community!](https://l.bentoml.com/join-slack) + +## 🎁 Contributing + +We welcome contributions! If you're interested in enhancing OpenLLM's +capabilities or have any questions, don't hesitate to reach out in our +[discord channel](https://l.bentoml.com/join-openllm-discord). + +Checkout our +[Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md) +if you wish to contribute to OpenLLM's codebase. + +## 📔 Citation + +If you use OpenLLM in your research, we provide a [citation](./CITATION.cff) to +use: + +```bibtex +@software{Pham_OpenLLM_Operating_LLMs_2023, +author = {Pham, Aaron and Yang, Chaoyu and Sheng, Sean and Zhao, Shenyang and Lee, Sauyon and Jiang, Bo and Dong, Fog and Guan, Xipeng and Ming, Frost}, +license = {Apache-2.0}, +month = jun, +title = {{OpenLLM: Operating LLMs in production}}, +url = {https://github.com/bentoml/OpenLLM}, +year = {2023} +} +``` + +