docs: Update README.md (#964)

* Update README.md

Signed-off-by: Chaoyu <paranoyang@gmail.com>

* Update README.md

Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com>
Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>

---------

Signed-off-by: Chaoyu <paranoyang@gmail.com>
Signed-off-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Sherlock Xu <65327072+Sherlock113@users.noreply.github.com>
This commit is contained in:
Chaoyu
2024-04-24 13:23:24 -07:00
committed by GitHub
parent cb7a9a0e12
commit f9221879f0

161
README.md
View File

@@ -3,61 +3,32 @@
<!-- hatch-fancy-pypi-readme intro start -->
<div align="center">
<h1 align="center">🦾 OpenLLM: Self-Hosting Large Language Models Made Easy</h1>
<h1 align="center">🦾 OpenLLM: Self-Hosting LLMs Made Easy</h1>
<a href="https://pypi.org/project/openllm">
<img src="https://img.shields.io/pypi/v/openllm.svg?logo=pypi&label=PyPI&logoColor=gold" alt="pypi_status" />
</a><a href="https://test.pypi.org/project/openllm/">
<img src="https://img.shields.io/badge/Nightly-PyPI?logo=pypi&label=PyPI&color=gray&link=https%3A%2F%2Ftest.pypi.org%2Fproject%2Fopenllm%2F" alt="test_pypi_status" />
</a><a href="https://twitter.com/bentomlai">
<img src="https://badgen.net/badge/icon/@bentomlai/1DA1F2?icon=twitter&label=Follow%20Us" alt="Twitter" />
</a><a href="https://l.bentoml.com/join-openllm-discord">
<img src="https://badgen.net/badge/icon/OpenLLM/7289da?icon=discord&label=Join%20Us" alt="Discord" />
</a><a href="https://github.com/bentoml/OpenLLM/actions/workflows/ci.yml">
<img src="https://github.com/bentoml/OpenLLM/actions/workflows/ci.yml/badge.svg?branch=main" alt="ci" />
</a><a href="https://results.pre-commit.ci/latest/github/bentoml/OpenLLM/main">
<img src="https://results.pre-commit.ci/badge/github/bentoml/OpenLLM/main.svg" alt="pre-commit.ci status" />
</a><br>
<a href="https://pypi.org/project/openllm">
<img src="https://img.shields.io/pypi/pyversions/openllm.svg?logo=python&label=Python&logoColor=gold" alt="python_version" />
</a><a href="https://github.com/pypa/hatch">
<img src="https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg" alt="Hatch" />
</a><a href="https://github.com/bentoml/OpenLLM/blob/main/STYLE.md">
<img src="https://img.shields.io/badge/code%20style-Google-000000.svg" alt="code style" />
</a><a href="https://github.com/astral-sh/ruff">
<img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json" alt="Ruff" />
</a><a href="https://github.com/python/mypy">
<img src="https://img.shields.io/badge/types-mypy-blue.svg" alt="types - mypy" />
</a><a href="https://github.com/microsoft/pyright">
<img src="https://img.shields.io/badge/types-pyright-yellow.svg" alt="types - pyright" />
</a><br>
<p>Run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints, locally and in the cloud.</p>
<i></i>
</a><br><a href="https://twitter.com/bentomlai">
<img src="https://badgen.net/badge/icon/@bentomlai/1DA1F2?icon=twitter&label=Follow%20Us" alt="Twitter" />
</a><a href="https://l.bentoml.com/join-openllm-discord">
<img src="https://badgen.net/badge/icon/OpenLLM/7289da?icon=discord&label=Join%20Us" alt="Discord" />
</a>
</div>
## 📖 Introduction
OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications.
OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment.
Key features include:
🚂 **State-of-the-art LLMs**: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder.
- 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
- ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
- 🔥 State-of-the-art serving and inference performance
- 🎯 Simplified cloud deployment via [BentoML](www.bentoml.com)
🔥 **Flexible APIs**: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice.
⛓️ **Freedom to build**: First-class support for LangChain, BentoML, LlamaIndex, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.
🎯 **Streamline deployment**: Automatically generate your LLM server Docker images or deploy as serverless endpoints via
[☁️ BentoCloud](https://l.bentoml.com/bento-cloud), which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness.
🤖️ **Bring your own LLM**: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (`LLM.tuning()`) is coming soon.
⚡ **Quantization**: Run inference with less computational and memory costs with quantization techniques such as [LLM.int8](https://arxiv.org/abs/2208.07339), [SpQR (int4)](https://arxiv.org/abs/2306.03078), [AWQ](https://arxiv.org/pdf/2306.00978.pdf), [GPTQ](https://arxiv.org/abs/2210.17323), and [SqueezeLLM](https://arxiv.org/pdf/2306.07629v2.pdf).
📡 **Streaming**: Support token streaming through server-sent events (SSE). You can use the `/v1/generate_stream` endpoint for streaming responses from LLMs.
🔄 **Continuous batching**: Support continuous batching via [vLLM](https://github.com/vllm-project/vllm) for increased total throughput.
OpenLLM is designed for AI application developers working to build production-ready applications based on LLMs. It delivers a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring these models, simplifying the end-to-end deployment workflow for LLMs.
<!-- hatch-fancy-pypi-readme intro stop -->
@@ -70,6 +41,7 @@ OpenLLM is designed for AI application developers working to build production-re
## 💾 TL/DR
For starter, we provide two ways to quickly try out OpenLLM:
### Jupyter Notebooks
Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb).
@@ -93,6 +65,7 @@ docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b
## 🏃 Get started
The following provides instructions for how to get started with OpenLLM locally.
### Prerequisites
You have installed Python 3.8 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
@@ -133,7 +106,6 @@ Commands:
prune Remove all saved models, (and optionally bentos) built with OpenLLM locally.
query Query a LLM interactively, from a terminal.
start Start a LLMServer for any supported LLM.
start-grpc Start a gRPC LLMServer for any supported LLM.
Extensions:
build-base-container Base image builder for BentoLLM.
@@ -1121,49 +1093,11 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
> [!IMPORTANT]
> Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.
## 🐍 Python SDK
Each LLM can be instantiated with `openllm.LLM`:
```python
import openllm
llm = openllm.LLM('microsoft/phi-2')
```
The main inference API is the streaming `generate_iterator` method:
```python
async for generation in llm.generate_iterator('What is the meaning of life?'):
print(generation.outputs[0].text)
```
> [!NOTE]
> The motivation behind making `llm.generate_iterator` an async generator is to provide support for Continuous batching with vLLM backend. By having the async endpoints, each prompt
> will be added correctly to the request queue to process with vLLM backend.
There is also a _one-shot_ `generate` method:
```python
await llm.generate('What is the meaning of life?')
```
This method is easy to use for one-shot generation use case, but merely served as an example how to use `llm.generate_iterator` as it uses `generate_iterator` under the hood.
> [!IMPORTANT]
> If you need to call your code in a synchronous context, you can use `asyncio.run` that wraps an async function:
>
> ```python
> import asyncio
> async def generate(prompt, **attrs): return await llm.generate(prompt, **attrs)
> asyncio.run(generate("The meaning of life is", temperature=0.23))
> ```
## ⚙️ Integrations
OpenLLM is not just a standalone product; it's a building block designed to
integrate with other powerful tools easily. We currently offer integration with
[BentoML](https://github.com/bentoml/BentoML),
[OpenAI's Compatible Endpoints](https://platform.openai.com/docs/api-reference/completions/object),
[LlamaIndex](https://www.llamaindex.ai/),
[LangChain](https://github.com/hwchase17/langchain), and
@@ -1192,29 +1126,8 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod
> You can find out OpenAI example clients under the
> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
### BentoML
OpenLLM LLM can be integrated as a
[Runner](https://docs.bentoml.com/en/latest/concepts/runner.html) in your
BentoML service. Simply call `await llm.generate` to generate text. Note that
`llm.generate` uses `runner` under the hood:
```python
import bentoml
import openllm
llm = openllm.LLM('microsoft/phi-2')
svc = bentoml.Service(name='llm-phi-service', runners=[llm.runner])
@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
generation = await llm.generate(input_text)
return generation.outputs[0].text
```
### [LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/modules.html#openllm)
### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/)
To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`:
@@ -1244,26 +1157,9 @@ from llama_index.llms.openllm import OpenLLMAPI
> [!NOTE]
> All synchronous and asynchronous API from `llama_index.llms.LLM` are supported.
### [LangChain](https://python.langchain.com/docs/ecosystem/integrations/openllm)
### [LangChain](https://python.langchain.com/docs/integrations/llms/openllm/)
To quickly start a local LLM with `langchain`, simply do the following:
```python
from langchain.llms import OpenLLM
llm = OpenLLM(model_name='llama', model_id='meta-llama/Llama-2-7b-hf')
llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
```
> [!IMPORTANT]
> By default, OpenLLM use `safetensors` format for saving models.
> If the model doesn't support safetensors, make sure to pass
> `serialisation="legacy"` to use the legacy PyTorch bin format.
`langchain.llms.OpenLLM` has the capability to interact with remote OpenLLM
Server. Given there is an OpenLLM server deployed elsewhere, you can connect to
it by specifying its URL:
Spin up an OpenLLM server, and connect to it by specifying its URL:
```python
from langchain.llms import OpenLLM
@@ -1272,24 +1168,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
```
To integrate a LangChain agent with BentoML, you can do the following:
```python
llm = OpenLLM(model_id='google/flan-t5-large', embedded=False, serialisation='legacy')
tools = load_tools(['serpapi', 'llm-math'], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
svc = bentoml.Service('langchain-openllm', runners=[llm.runner])
@svc.api(input=Text(), output=Text())
def chat(input_text: str):
return agent.run(input_text)
```
> [!NOTE]
> You can find out more examples under the
> [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
### Transformers Agents
OpenLLM seamlessly integrates with
@@ -1346,11 +1224,10 @@ There are several ways to deploy your LLMs:
### ☁️ BentoCloud
Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/bento-cloud/), the
serverless cloud for shipping and scaling AI applications.
Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/), the inference platform
for fast moving AI teams.
1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/cloud)
for early access
1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/)
2. **Log into your BentoCloud account:**