diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 70f7d855..b8849069 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -20,11 +20,6 @@ on: - "main" tags: - "v*" - paths: - - ".github/workflows/build.yaml" - - "src/openllm/bundle/oci/Dockerfile" - - "src/openllm/**" - - "src/openllm_client/**" pull_request: branches: - "main" diff --git a/CHANGELOG.md b/CHANGELOG.md index c03e905d..abb733db 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -18,6 +18,55 @@ This changelog is managed by towncrier and is compiled at release time. +## Changes for the Upcoming Release + +> **Warning**: These changes reflect the current [development progress](https://github.com/bentoml/openllm/tree/main) +> and have **not** been part of a official PyPI release yet. +> To try out the latest change, one can do: `pip install -U git+https://github.com/bentoml/openllm.git@main` + + + +### Features + +- Added support for base container with OpenLLM. The base container will contains all necessary requirements + to run OpenLLM. Currently it does included compiled version of FlashAttention v2, vLLM, AutoGPTQ and triton. + + This will now be the base image for all future BentoLLM. The image will also be published to public GHCR. + + To extend and use this image into your bento, simply specify ``base_image`` under ``bentofile.yaml``: + + ```yaml + docker: + base_image: ghcr.io/bentoml/openllm: + ``` + + The release strategy would include: + - versioning of ``ghcr.io/bentoml/openllm:sha-`` for every commit to main, ``ghcr.io/bentoml/openllm:0.2.11`` for specific release version + - alias ``latest`` will be managed with docker/build-push-action (discouraged) + + Note that all these images include compiled kernels that has been tested on Ampere GPUs with CUDA 11.8. + + To quickly run the image, do the following: + + ```bash + docker run --rm --gpus all -it -v /home/ubuntu/.local/share/bentoml:/tmp/bentoml -e BENTOML_HOME=/tmp/bentoml \ + -e OPENLLM_USE_LOCAL_LATEST=True -e OPENLLM_LLAMA_FRAMEWORK=vllm ghcr.io/bentoml/openllm:2b5e96f90ad314f54e07b5b31e386e7d688d9bb2 start llama --model-id meta-llama/Llama-2-7b-chat-hf --workers-per-resource conserved --debug` + ``` + + In conjunction with this, OpenLLM now also have a set of small CLI utilities via ``openllm ext`` for ease-of-use + + General fixes around codebase bytecode optimization + + Fixes logs output to filter correct level based on ``--debug`` and ``--quiet`` + + ``openllm build`` now will default run model check locally. To skip it pass in ``--fast`` (before this is the default behaviour, but ``--no-fast`` as default makes more sense here as ``openllm build`` should also be able to run standalone) + + All ``LlaMA`` namespace has been renamed to ``Llama`` (internal change and shouldn't affect end users) + + ``openllm.AutoModel.for_model`` now will always return the instance. Runner kwargs will be handled via create_runner + [#142](https://github.com/bentoml/openllm/issues/142) + + ## [0.2.11](https://github.com/bentoml/openllm/tree/v0.2.11) ### Features diff --git a/README.md b/README.md index b5531b68..3c662a03 100644 --- a/README.md +++ b/README.md @@ -6,20 +6,25 @@

🦾 OpenLLM

pypi_status - - ci Twitter Discord + + ci + + pre-commit.ci status
- + python_version Hatch -
Ruff + + types - mypy + + types - pyright

An open platform for operating large language models (LLMs) in production.
Fine-tune, serve, deploy, and monitor any LLMs with ease.

@@ -120,17 +125,19 @@ openllm query 'Explain to me the difference between "further" and "farther"' Visit `http://localhost:3000/docs.json` for OpenLLM's API specification. -OpenLLM seamlessly supports many models and their variants. -Users can also specify different variants of the model to be served, by -providing the `--model-id` argument, e.g.: +OpenLLM seamlessly supports many models and their variants. Users can also +specify different variants of the model to be served, by providing the +`--model-id` argument, e.g.: ```bash openllm start flan-t5 --model-id google/flan-t5-large ``` -> **Note** that `openllm` also supports all variants of fine-tuning weights, custom model path -> as well as quantized weights for any of the supported models as long as it can be loaded with -> the model architecture. Refer to [supported models](https://github.com/bentoml/OpenLLM/tree/main#-supported-models) section for models' architecture. +> **Note** that `openllm` also supports all variants of fine-tuning weights, +> custom model path as well as quantized weights for any of the supported models +> as long as it can be loaded with the model architecture. Refer to +> [supported models](https://github.com/bentoml/OpenLLM/tree/main#-supported-models) +> section for models' architecture. Use the `openllm models` command to see the list of models and their variants supported in OpenLLM. @@ -473,8 +480,8 @@ To include this into the Bento, one can also provide a `--adapter-id` into openllm build opt --model-id facebook/opt-6.7b --adapter-id ... ``` -> **Note**: We will gradually roll out support for fine-tuning all models. -> The following models contain fine-tuning support: OPT, Falcon, LlaMA. +> **Note**: We will gradually roll out support for fine-tuning all models. The +> following models contain fine-tuning support: OPT, Falcon, LlaMA. ### Integrating a New Model @@ -485,10 +492,10 @@ to see how you can do it yourself. ### Embeddings -OpenLLM tentatively provides embeddings endpoint for supported models. -This can be accessed via `/v1/embeddings`. +OpenLLM tentatively provides embeddings endpoint for supported models. This can +be accessed via `/v1/embeddings`. -To use via CLI, simply call ``openllm embed``: +To use via CLI, simply call `openllm embed`: ```bash openllm embed --endpoint http://localhost:3000 "I like to eat apples" -o json @@ -508,7 +515,7 @@ openllm embed --endpoint http://localhost:3000 "I like to eat apples" -o json } ``` -To invoke this endpoint, use ``client.embed`` from the Python SDK: +To invoke this endpoint, use `client.embed` from the Python SDK: ```python import openllm @@ -518,15 +525,16 @@ client = openllm.client.HTTPClient("http://localhost:3000") client.embed("I like to eat apples") ``` -> **Note**: Currently, the following model framily supports embeddings: Llama, T5 (Flan-T5, FastChat, etc.), ChatGLM +> **Note**: Currently, the following model framily supports embeddings: Llama, +> T5 (Flan-T5, FastChat, etc.), ChatGLM ## ⚙️ Integrations OpenLLM is not just a standalone product; it's a building block designed to integrate with other powerful tools easily. We currently offer integration with [BentoML](https://github.com/bentoml/BentoML), -[LangChain](https://github.com/hwchase17/langchain), -and [Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents). +[LangChain](https://github.com/hwchase17/langchain), and +[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents). ### BentoML @@ -555,7 +563,6 @@ async def prompt(input_text: str) -> str: return answer ``` - ### [LangChain](https://python.langchain.com/docs/ecosystem/integrations/openllm) To quickly start a local LLM with `langchain`, simply do the following: @@ -600,15 +607,14 @@ def chat(input_text: str): > **Note** You can find out more examples under the > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder. - ### Transformers Agents -OpenLLM seamlessly integrates with [Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents). - +OpenLLM seamlessly integrates with +[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents). > **Warning** The Transformers Agent is still at an experimental stage. It is -> recommended to install OpenLLM with `pip install -r nightly-requirements.txt` to get -> the latest API update for HuggingFace agent. +> recommended to install OpenLLM with `pip install -r nightly-requirements.txt` +> to get the latest API update for HuggingFace agent. ```python import transformers @@ -665,15 +671,14 @@ There are several ways to deploy your LLMs: ```bash bentoml containerize ``` - This generates a OCI-compatible docker image that can be deployed anywhere docker runs. - For best scalability and reliability of your LLM service in production, we recommend deploy - with BentoCloud。 - + This generates a OCI-compatible docker image that can be deployed anywhere + docker runs. For best scalability and reliability of your LLM service in + production, we recommend deploy with BentoCloud。 ### ☁️ BentoCloud -Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/bento-cloud/), -the serverless cloud for shipping and scaling AI applications. +Deploy OpenLLM with [BentoCloud](https://www.bentoml.com/bento-cloud/), the +serverless cloud for shipping and scaling AI applications. 1. **Create a BentoCloud account:** [sign up here](https://bentoml.com/cloud) for early access @@ -705,7 +710,6 @@ the serverless cloud for shipping and scaling AI applications. `bentoml deployment create` command following the [deployment instructions](https://docs.bentoml.com/en/latest/reference/cli.html#bentoml-deployment-create). - ## 👥 Community Engage with like-minded individuals passionate about LLMs, AI, and more on our