mirror of
https://github.com/bentoml/OpenLLM.git
synced 2026-06-11 09:59:20 -04:00
infra: prepare 0.5 releases (#996)
* chore: prepare for 0.5 Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: update changelogs Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: fix to lowest python version supported Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: update scripts Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> --------- Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
This commit is contained in:
125
openllm-python/README.md
generated
125
openllm-python/README.md
generated
@@ -23,13 +23,11 @@
|
||||
|
||||
OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment.
|
||||
|
||||
|
||||
- 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
|
||||
- ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
|
||||
- 🔥 State-of-the-art serving and inference performance
|
||||
- 🎯 Simplified cloud deployment via [BentoML](https://www.bentoml.com)
|
||||
|
||||
|
||||
<!-- hatch-fancy-pypi-readme intro stop -->
|
||||
|
||||

|
||||
@@ -46,29 +44,13 @@ For starter, we provide two ways to quickly try out OpenLLM:
|
||||
|
||||
Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb).
|
||||
|
||||
### Docker
|
||||
|
||||
We provide a docker container that helps you start running OpenLLM:
|
||||
|
||||
```bash
|
||||
docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> Given you have access to GPUs and have setup [nvidia-docker](https://github.com/NVIDIA/nvidia-container-toolkit), you can additionally pass in `--gpus`
|
||||
> to use GPU for faster inference and optimization
|
||||
>```bash
|
||||
> docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
|
||||
> ```
|
||||
|
||||
|
||||
## 🏃 Get started
|
||||
|
||||
The following provides instructions for how to get started with OpenLLM locally.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
You have installed Python 3.8 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
|
||||
You have installed Python 3.9 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
|
||||
|
||||
### Install OpenLLM
|
||||
|
||||
@@ -82,65 +64,23 @@ To verify the installation, run:
|
||||
|
||||
```bash
|
||||
$ openllm -h
|
||||
|
||||
Usage: openllm [OPTIONS] COMMAND [ARGS]...
|
||||
|
||||
██████╗ ██████╗ ███████╗███╗ ██╗██╗ ██╗ ███╗ ███╗
|
||||
██╔═══██╗██╔══██╗██╔════╝████╗ ██║██║ ██║ ████╗ ████║
|
||||
██║ ██║██████╔╝█████╗ ██╔██╗ ██║██║ ██║ ██╔████╔██║
|
||||
██║ ██║██╔═══╝ ██╔══╝ ██║╚██╗██║██║ ██║ ██║╚██╔╝██║
|
||||
╚██████╔╝██║ ███████╗██║ ╚████║███████╗███████╗██║ ╚═╝ ██║
|
||||
╚═════╝ ╚═╝ ╚══════╝╚═╝ ╚═══╝╚══════╝╚══════╝╚═╝ ╚═╝.
|
||||
|
||||
An open platform for operating large language models in production.
|
||||
Fine-tune, serve, deploy, and monitor any LLMs with ease.
|
||||
|
||||
Options:
|
||||
-v, --version Show the version and exit.
|
||||
-h, --help Show this message and exit.
|
||||
|
||||
Commands:
|
||||
build Package a given models into a BentoLLM.
|
||||
import Setup LLM interactively.
|
||||
models List all supported models.
|
||||
prune Remove all saved models, (and optionally bentos) built with OpenLLM locally.
|
||||
query Query a LLM interactively, from a terminal.
|
||||
start Start a LLMServer for any supported LLM.
|
||||
|
||||
Extensions:
|
||||
build-base-container Base image builder for BentoLLM.
|
||||
dive-bentos Dive into a BentoLLM.
|
||||
get-containerfile Return Containerfile of any given Bento.
|
||||
get-prompt Get the default prompt used by OpenLLM.
|
||||
list-bentos List available bentos built by OpenLLM.
|
||||
list-models This is equivalent to openllm models...
|
||||
playground OpenLLM Playground.
|
||||
```
|
||||
|
||||
### Start a LLM server
|
||||
|
||||
OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [phi-2](https://huggingface.co/microsoft/phi-2) server, run the following:
|
||||
OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [Llama 3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) server, run the following:
|
||||
|
||||
```bash
|
||||
TRUST_REMOTE_CODE=True openllm start microsoft/phi-2
|
||||
openllm start meta-llama/Meta-Llama-3-8B
|
||||
```
|
||||
|
||||
This starts the server at [http://0.0.0.0:3000/](http://0.0.0.0:3000/). OpenLLM downloads the model to the BentoML local Model Store if it has not been registered before. To view your local models, run `bentoml models list`.
|
||||
|
||||
To interact with the server, you can visit the web UI at [http://0.0.0.0:3000/](http://0.0.0.0:3000/) or send a request using `curl`. You can also use OpenLLM’s built-in Python client to interact with the server:
|
||||
|
||||
```python
|
||||
import openllm
|
||||
|
||||
client = openllm.client.HTTPClient('http://localhost:3000')
|
||||
client.query('Explain to me the difference between "further" and "farther"')
|
||||
```
|
||||
|
||||
Alternatively, use the `openllm query` command to query the model:
|
||||
|
||||
```bash
|
||||
export OPENLLM_ENDPOINT=http://localhost:3000
|
||||
openllm query 'Explain to me the difference between "further" and "farther"'
|
||||
client = openllm.HTTPClient('http://localhost:3000')
|
||||
client.generate('Explain to me the difference between "further" and "farther"')
|
||||
```
|
||||
|
||||
OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:
|
||||
@@ -155,15 +95,6 @@ openllm start <model_id> --<options>
|
||||
> architecture. Use the `openllm models` command to see the complete list of supported
|
||||
> models, their architectures, and their variants.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> If you are testing OpenLLM on CPU, you might want to pass in `DTYPE=float32`. By default,
|
||||
> OpenLLM will set model `dtype` to `bfloat16` for the best performance.
|
||||
> ```bash
|
||||
> DTYPE=float32 openllm start microsoft/phi-2
|
||||
> ```
|
||||
> This will also applies to older GPUs. If your GPUs doesn't support `bfloat16`, then you also
|
||||
> want to set `DTYPE=float16`.
|
||||
|
||||
## 🧩 Supported models
|
||||
|
||||
OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
|
||||
@@ -1097,7 +1028,6 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
|
||||
> [!IMPORTANT]
|
||||
> Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.
|
||||
|
||||
|
||||
## ⚙️ Integrations
|
||||
|
||||
OpenLLM is not just a standalone product; it's a building block designed to
|
||||
@@ -1115,11 +1045,9 @@ specify the base_url to `llm-endpoint/v1` and you are good to go:
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(
|
||||
base_url='http://localhost:3000/v1', api_key='na'
|
||||
) # Here the server is running on localhost:3000
|
||||
client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na') # Here the server is running on 0.0.0.0:3000
|
||||
|
||||
completions = client.completions.create(
|
||||
completions = client.chat.completions.create(
|
||||
prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
|
||||
)
|
||||
```
|
||||
@@ -1130,7 +1058,6 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod
|
||||
> You can find out OpenAI example clients under the
|
||||
> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
|
||||
|
||||
|
||||
### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/)
|
||||
|
||||
To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`:
|
||||
@@ -1172,24 +1099,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
|
||||
llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
|
||||
```
|
||||
|
||||
### Transformers Agents
|
||||
|
||||
OpenLLM seamlessly integrates with
|
||||
[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).
|
||||
|
||||
> [!WARNING]
|
||||
> The Transformers Agent is still at an experimental stage. It is
|
||||
> recommended to install OpenLLM with `pip install -r nightly-requirements.txt`
|
||||
> to get the latest API update for HuggingFace agent.
|
||||
|
||||
```python
|
||||
import transformers
|
||||
|
||||
agent = transformers.HfAgent('http://localhost:3000/hf/agent') # URL that runs the OpenLLM server
|
||||
|
||||
agent.run('Is the following `text` positive or negative?', text="I don't like how this models is generate inputs")
|
||||
```
|
||||
|
||||
<!-- hatch-fancy-pypi-readme interim stop -->
|
||||
|
||||

|
||||
@@ -1280,26 +1189,6 @@ Checkout our
|
||||
[Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md)
|
||||
if you wish to contribute to OpenLLM's codebase.
|
||||
|
||||
## 🍇 Telemetry
|
||||
|
||||
OpenLLM collects usage data to enhance user experience and improve the product.
|
||||
We only report OpenLLM's internal API calls and ensure maximum privacy by
|
||||
excluding sensitive information. We will never collect user code, model data, or
|
||||
stack traces. For usage tracking, check out the
|
||||
[code](https://github.com/bentoml/OpenLLM/blob/main/openllm-core/src/openllm_core/utils/analytics.py).
|
||||
|
||||
You can opt out of usage tracking by using the `--do-not-track` CLI option:
|
||||
|
||||
```bash
|
||||
openllm [command] --do-not-track
|
||||
```
|
||||
|
||||
Or by setting the environment variable `OPENLLM_DO_NOT_TRACK=True`:
|
||||
|
||||
```bash
|
||||
export OPENLLM_DO_NOT_TRACK=True
|
||||
```
|
||||
|
||||
## 📔 Citation
|
||||
|
||||
If you use OpenLLM in your research, we provide a [citation](./CITATION.cff) to
|
||||
|
||||
Reference in New Issue
Block a user