infra: prepare 0.5 releases (#996)

* chore: prepare for 0.5

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

* chore: update changelogs

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

* chore: fix to lowest python version supported

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

* chore: update scripts

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

---------

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
This commit is contained in:
Aaron Pham
2024-05-23 12:50:01 -04:00
committed by GitHub
parent a410b9cfe8
commit 5e97329bcb
4 changed files with 22 additions and 237 deletions

125
openllm-python/README.md generated
View File

@@ -23,13 +23,11 @@
OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment.
- 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
- ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
- 🔥 State-of-the-art serving and inference performance
- 🎯 Simplified cloud deployment via [BentoML](https://www.bentoml.com)
<!-- hatch-fancy-pypi-readme intro stop -->
![Gif showing OpenLLM Intro](/.github/assets/output.gif)
@@ -46,29 +44,13 @@ For starter, we provide two ways to quickly try out OpenLLM:
Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb).
### Docker
We provide a docker container that helps you start running OpenLLM:
```bash
docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt
```
> [!NOTE]
> Given you have access to GPUs and have setup [nvidia-docker](https://github.com/NVIDIA/nvidia-container-toolkit), you can additionally pass in `--gpus`
> to use GPU for faster inference and optimization
>```bash
> docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
> ```
## 🏃 Get started
The following provides instructions for how to get started with OpenLLM locally.
### Prerequisites
You have installed Python 3.8 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
You have installed Python 3.9 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
### Install OpenLLM
@@ -82,65 +64,23 @@ To verify the installation, run:
```bash
$ openllm -h
Usage: openllm [OPTIONS] COMMAND [ARGS]...
██████╗ ██████╗ ███████╗███╗ ██╗██╗ ██╗ ███╗ ███╗
██╔═══██╗██╔══██╗██╔════╝████╗ ██║██║ ██║ ████╗ ████║
██║ ██║██████╔╝█████╗ ██╔██╗ ██║██║ ██║ ██╔████╔██║
██║ ██║██╔═══╝ ██╔══╝ ██║╚██╗██║██║ ██║ ██║╚██╔╝██║
╚██████╔╝██║ ███████╗██║ ╚████║███████╗███████╗██║ ╚═╝ ██║
╚═════╝ ╚═╝ ╚══════╝╚═╝ ╚═══╝╚══════╝╚══════╝╚═╝ ╚═╝.
An open platform for operating large language models in production.
Fine-tune, serve, deploy, and monitor any LLMs with ease.
Options:
-v, --version Show the version and exit.
-h, --help Show this message and exit.
Commands:
build Package a given models into a BentoLLM.
import Setup LLM interactively.
models List all supported models.
prune Remove all saved models, (and optionally bentos) built with OpenLLM locally.
query Query a LLM interactively, from a terminal.
start Start a LLMServer for any supported LLM.
Extensions:
build-base-container Base image builder for BentoLLM.
dive-bentos Dive into a BentoLLM.
get-containerfile Return Containerfile of any given Bento.
get-prompt Get the default prompt used by OpenLLM.
list-bentos List available bentos built by OpenLLM.
list-models This is equivalent to openllm models...
playground OpenLLM Playground.
```
### Start a LLM server
OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [phi-2](https://huggingface.co/microsoft/phi-2) server, run the following:
OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [Llama 3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) server, run the following:
```bash
TRUST_REMOTE_CODE=True openllm start microsoft/phi-2
openllm start meta-llama/Meta-Llama-3-8B
```
This starts the server at [http://0.0.0.0:3000/](http://0.0.0.0:3000/). OpenLLM downloads the model to the BentoML local Model Store if it has not been registered before. To view your local models, run `bentoml models list`.
To interact with the server, you can visit the web UI at [http://0.0.0.0:3000/](http://0.0.0.0:3000/) or send a request using `curl`. You can also use OpenLLMs built-in Python client to interact with the server:
```python
import openllm
client = openllm.client.HTTPClient('http://localhost:3000')
client.query('Explain to me the difference between "further" and "farther"')
```
Alternatively, use the `openllm query` command to query the model:
```bash
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'
client = openllm.HTTPClient('http://localhost:3000')
client.generate('Explain to me the difference between "further" and "farther"')
```
OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:
@@ -155,15 +95,6 @@ openllm start <model_id> --<options>
> architecture. Use the `openllm models` command to see the complete list of supported
> models, their architectures, and their variants.
> [!IMPORTANT]
> If you are testing OpenLLM on CPU, you might want to pass in `DTYPE=float32`. By default,
> OpenLLM will set model `dtype` to `bfloat16` for the best performance.
> ```bash
> DTYPE=float32 openllm start microsoft/phi-2
> ```
> This will also applies to older GPUs. If your GPUs doesn't support `bfloat16`, then you also
> want to set `DTYPE=float16`.
## 🧩 Supported models
OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
@@ -1097,7 +1028,6 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
> [!IMPORTANT]
> Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.
## ⚙️ Integrations
OpenLLM is not just a standalone product; it's a building block designed to
@@ -1115,11 +1045,9 @@ specify the base_url to `llm-endpoint/v1` and you are good to go:
```python
import openai
client = openai.OpenAI(
base_url='http://localhost:3000/v1', api_key='na'
) # Here the server is running on localhost:3000
client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na') # Here the server is running on 0.0.0.0:3000
completions = client.completions.create(
completions = client.chat.completions.create(
prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
)
```
@@ -1130,7 +1058,6 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod
> You can find out OpenAI example clients under the
> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/)
To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`:
@@ -1172,24 +1099,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
```
### Transformers Agents
OpenLLM seamlessly integrates with
[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).
> [!WARNING]
> The Transformers Agent is still at an experimental stage. It is
> recommended to install OpenLLM with `pip install -r nightly-requirements.txt`
> to get the latest API update for HuggingFace agent.
```python
import transformers
agent = transformers.HfAgent('http://localhost:3000/hf/agent') # URL that runs the OpenLLM server
agent.run('Is the following `text` positive or negative?', text="I don't like how this models is generate inputs")
```
<!-- hatch-fancy-pypi-readme interim stop -->
![Gif showing Agent integration](/.github/assets/agent.gif)
@@ -1280,26 +1189,6 @@ Checkout our
[Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md)
if you wish to contribute to OpenLLM's codebase.
## 🍇 Telemetry
OpenLLM collects usage data to enhance user experience and improve the product.
We only report OpenLLM's internal API calls and ensure maximum privacy by
excluding sensitive information. We will never collect user code, model data, or
stack traces. For usage tracking, check out the
[code](https://github.com/bentoml/OpenLLM/blob/main/openllm-core/src/openllm_core/utils/analytics.py).
You can opt out of usage tracking by using the `--do-not-track` CLI option:
```bash
openllm [command] --do-not-track
```
Or by setting the environment variable `OPENLLM_DO_NOT_TRACK=True`:
```bash
export OPENLLM_DO_NOT_TRACK=True
```
## 📔 Citation
If you use OpenLLM in your research, we provide a [citation](./CITATION.cff) to