infra: prepare 0.5 releases (#996)

* chore: prepare for 0.5 Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: update changelogs Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: fix to lowest python version supported Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> * chore: update scripts Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com> --------- Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2026-06-11 09:59:20 -04:00 · 2024-05-23 12:50:01 -04:00
parent a410b9cfe8
commit 5e97329bcb
4 changed files with 22 additions and 237 deletions
--- a/openllm-python/README.md
+++ b/openllm-python/README.md
@@ -23,13 +23,11 @@

 OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment.

-
 - 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
 - ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
 - 🔥 State-of-the-art serving and inference performance
 - 🎯 Simplified cloud deployment via [BentoML](https://www.bentoml.com)

-
 <!-- hatch-fancy-pypi-readme intro stop -->

 ![Gif showing OpenLLM Intro](/.github/assets/output.gif)
@@ -46,29 +44,13 @@ For starter, we provide two ways to quickly try out OpenLLM:

 Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb).

-### Docker
-
-We provide a docker container that helps you start running OpenLLM:
-
-```bash
-docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt
-```
-
-> [!NOTE]
-> Given you have access to GPUs and have setup [nvidia-docker](https://github.com/NVIDIA/nvidia-container-toolkit),  you can additionally pass in `--gpus`
-> to use GPU for faster inference and optimization
->```bash
-> docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
-> ```
-
-
 ## 🏃 Get started

 The following provides instructions for how to get started with OpenLLM locally.

 ### Prerequisites

-You have installed Python 3.8 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
+You have installed Python 3.9 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.

 ### Install OpenLLM

@@ -82,65 +64,23 @@ To verify the installation, run:

 ```bash
 $ openllm -h
-
-Usage: openllm [OPTIONS] COMMAND [ARGS]...
-
-   ██████╗ ██████╗ ███████╗███╗   ██╗██╗     ██╗     ███╗   ███╗
-  ██╔═══██╗██╔══██╗██╔════╝████╗  ██║██║     ██║     ████╗ ████║
-  ██║   ██║██████╔╝█████╗  ██╔██╗ ██║██║     ██║     ██╔████╔██║
-  ██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║██║     ██║     ██║╚██╔╝██║
-  ╚██████╔╝██║     ███████╗██║ ╚████║███████╗███████╗██║ ╚═╝ ██║
-   ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝╚══════╝╚══════╝╚═╝     ╚═╝.
-
-  An open platform for operating large language models in production.
-  Fine-tune, serve, deploy, and monitor any LLMs with ease.
-
-Options:
-  -v, --version  Show the version and exit.
-  -h, --help     Show this message and exit.
-
-Commands:
-  build       Package a given models into a BentoLLM.
-  import      Setup LLM interactively.
-  models      List all supported models.
-  prune       Remove all saved models, (and optionally bentos) built with OpenLLM locally.
-  query       Query a LLM interactively, from a terminal.
-  start       Start a LLMServer for any supported LLM.
-
-Extensions:
-  build-base-container  Base image builder for BentoLLM.
-  dive-bentos           Dive into a BentoLLM.
-  get-containerfile     Return Containerfile of any given Bento.
-  get-prompt            Get the default prompt used by OpenLLM.
-  list-bentos           List available bentos built by OpenLLM.
-  list-models           This is equivalent to openllm models...
-  playground            OpenLLM Playground.
 ```

 ### Start a LLM server

-OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [phi-2](https://huggingface.co/microsoft/phi-2) server, run the following:
+OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [Llama 3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) server, run the following:

 ```bash
-TRUST_REMOTE_CODE=True openllm start microsoft/phi-2
+openllm start meta-llama/Meta-Llama-3-8B
 ```

-This starts the server at [http://0.0.0.0:3000/](http://0.0.0.0:3000/). OpenLLM downloads the model to the BentoML local Model Store if it has not been registered before. To view your local models, run `bentoml models list`.
-
 To interact with the server, you can visit the web UI at [http://0.0.0.0:3000/](http://0.0.0.0:3000/) or send a request using `curl`. You can also use OpenLLM’s built-in Python client to interact with the server:

 ```python
 import openllm

-client = openllm.client.HTTPClient('http://localhost:3000')
-client.query('Explain to me the difference between "further" and "farther"')
-```
-
-Alternatively, use the `openllm query` command to query the model:
-
-```bash
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'Explain to me the difference between "further" and "farther"'
+client = openllm.HTTPClient('http://localhost:3000')
+client.generate('Explain to me the difference between "further" and "farther"')
 ```

 OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:
@@ -155,15 +95,6 @@ openllm start <model_id> --<options>
 > architecture. Use the `openllm models` command to see the complete list of supported
 > models, their architectures, and their variants.

-> [!IMPORTANT]
-> If you are testing OpenLLM on CPU, you might want to pass in `DTYPE=float32`. By default,
-> OpenLLM will set model `dtype` to `bfloat16` for the best performance.
-> ```bash
-> DTYPE=float32 openllm start microsoft/phi-2
-> ```
-> This will also applies to older GPUs. If your GPUs doesn't support `bfloat16`, then you also
-> want to set `DTYPE=float16`.
-
 ## 🧩 Supported models

 OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
@@ -1097,7 +1028,6 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
 > [!IMPORTANT]
 > Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.

-
 ## ⚙️ Integrations

 OpenLLM is not just a standalone product; it's a building block designed to
@@ -1115,11 +1045,9 @@ specify the base_url to `llm-endpoint/v1` and you are good to go:
 ```python
 import openai

-client = openai.OpenAI(
-  base_url='http://localhost:3000/v1', api_key='na'
-)  # Here the server is running on localhost:3000
+client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na')  # Here the server is running on 0.0.0.0:3000

-completions = client.completions.create(
+completions = client.chat.completions.create(
  prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
 )
 ```
@@ -1130,7 +1058,6 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod
 > You can find out OpenAI example clients under the
 > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.

-
 ### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/)

 To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`:
@@ -1172,24 +1099,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
 llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
 ```

-### Transformers Agents
-
-OpenLLM seamlessly integrates with
-[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).
-
-> [!WARNING]
-> The Transformers Agent is still at an experimental stage. It is
-> recommended to install OpenLLM with `pip install -r nightly-requirements.txt`
-> to get the latest API update for HuggingFace agent.
-
-```python
-import transformers
-
-agent = transformers.HfAgent('http://localhost:3000/hf/agent')  # URL that runs the OpenLLM server
-
-agent.run('Is the following `text` positive or negative?', text="I don't like how this models is generate inputs")
-```
-
 <!-- hatch-fancy-pypi-readme interim stop -->

 ![Gif showing Agent integration](/.github/assets/agent.gif)
@@ -1280,26 +1189,6 @@ Checkout our
 [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md)
 if you wish to contribute to OpenLLM's codebase.

-## 🍇 Telemetry
-
-OpenLLM collects usage data to enhance user experience and improve the product.
-We only report OpenLLM's internal API calls and ensure maximum privacy by
-excluding sensitive information. We will never collect user code, model data, or
-stack traces. For usage tracking, check out the
-[code](https://github.com/bentoml/OpenLLM/blob/main/openllm-core/src/openllm_core/utils/analytics.py).
-
-You can opt out of usage tracking by using the `--do-not-track` CLI option:
-
-```bash
-openllm [command] --do-not-track
-```
-
-Or by setting the environment variable `OPENLLM_DO_NOT_TRACK=True`:
-
-```bash
-export OPENLLM_DO_NOT_TRACK=True
-```
-
 ## 📔 Citation

 If you use OpenLLM in your research, we provide a [citation](./CITATION.cff) to