mirror of
https://github.com/bentoml/OpenLLM.git
synced 2026-01-19 04:51:18 -05:00
1347 lines
40 KiB
Markdown
Generated
1347 lines
40 KiB
Markdown
Generated
# Changelog
|
|
|
|
We are following [semantic versioning](https://semver.org/) with strict
|
|
backward-compatibility policy.
|
|
|
|
You can find out backwards-compatibility policy
|
|
[here](https://github.com/bentoml/openllm/blob/main/.github/SECURITY.md).
|
|
|
|
Changes for the upcoming release can be found in the
|
|
['changelog.d' directory](https://github.com/bentoml/openllm/tree/main/changelog.d)
|
|
in our repository.
|
|
|
|
<!--
|
|
Do *NOT* add changelog entries here!
|
|
|
|
This changelog is managed by towncrier and is compiled at release time.
|
|
-->
|
|
|
|
<!-- towncrier release notes start -->
|
|
|
|
## [0.4.40](https://github.com/bentoml/openllm/tree/v0.4.40)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.39](https://github.com/bentoml/openllm/tree/v0.4.39)
|
|
|
|
### Bug fix
|
|
|
|
- Fixes logprobs branch with PyTorch backend.
|
|
[#779](https://github.com/bentoml/openllm/issues/779)
|
|
|
|
## [0.4.38](https://github.com/bentoml/openllm/tree/v0.4.38)
|
|
|
|
### Changes
|
|
|
|
- Update correct arguments for both `openllm import` and `openllm build` to be synonymous with `openllm start`
|
|
[#775](https://github.com/bentoml/openllm/issues/775)
|
|
|
|
|
|
### Features
|
|
|
|
- Mixtral is now fully supported on BentoCloud.
|
|
|
|
```bash
|
|
openllm start mistralai/Mixtral-8x7B-Instruct-v0.1
|
|
```
|
|
[#774](https://github.com/bentoml/openllm/issues/774)
|
|
|
|
## [0.4.37](https://github.com/bentoml/openllm/tree/v0.4.37)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.36](https://github.com/bentoml/openllm/tree/v0.4.36)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.35](https://github.com/bentoml/openllm/tree/v0.4.35)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.34](https://github.com/bentoml/openllm/tree/v0.4.34)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.33](https://github.com/bentoml/openllm/tree/v0.4.33)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.32](https://github.com/bentoml/openllm/tree/v0.4.32)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.31](https://github.com/bentoml/openllm/tree/v0.4.31)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.30](https://github.com/bentoml/openllm/tree/v0.4.30)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.29](https://github.com/bentoml/openllm/tree/v0.4.29)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.28](https://github.com/bentoml/openllm/tree/v0.4.28)
|
|
|
|
### Changes
|
|
|
|
- Only baichuan2 and baichuan3 are supported. We dropped baichuan 1 support
|
|
[#728](https://github.com/bentoml/openllm/issues/728)
|
|
|
|
## [0.4.27](https://github.com/bentoml/openllm/tree/v0.4.27)
|
|
|
|
### Changes
|
|
|
|
- We will deprecate support for PyTorch backend and will enforce all
|
|
built Bento to use vLLM backend going forward. This means that `openllm build`
|
|
with `--backend pt` will now be deprecated and move to `--backend vllm`.
|
|
|
|
We will focus more on contributing upstream to vLLM and will ensure that the core
|
|
value of OpenLLM is to provide a flexible and as streamlined experience to bring these
|
|
models to production with ease.
|
|
|
|
PyTorch backend will be removed from 0.5.0 releases onwards.
|
|
|
|
The docker images will now only be available on GHCR and not on ECR anymore as a measure
|
|
to reduce cost and maintenance one our side
|
|
[#730](https://github.com/bentoml/openllm/issues/730)
|
|
|
|
## [0.4.26](https://github.com/bentoml/openllm/tree/v0.4.26)
|
|
|
|
### Features
|
|
|
|
- `/v1/chat/completions` now accepts two additional parameters
|
|
|
|
- `chat_templates`: this is a string of [Jinja templates](https://huggingface.co/docs/transformers/main/chat_templating#templates-for-chat-models)
|
|
to use with this models. By default, it will just use the default models provided chat templates based on config.json.
|
|
- `add_generation_prompt`: See [here](https://huggingface.co/docs/transformers/main/chat_templating#how-do-i-use-chat-templates)
|
|
[#725](https://github.com/bentoml/openllm/issues/725)
|
|
|
|
## [0.4.25](https://github.com/bentoml/openllm/tree/v0.4.25)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.24](https://github.com/bentoml/openllm/tree/v0.4.24)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.23](https://github.com/bentoml/openllm/tree/v0.4.23)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.22](https://github.com/bentoml/openllm/tree/v0.4.22)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.21](https://github.com/bentoml/openllm/tree/v0.4.21)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.20](https://github.com/bentoml/openllm/tree/v0.4.20)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.19](https://github.com/bentoml/openllm/tree/v0.4.19)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.18](https://github.com/bentoml/openllm/tree/v0.4.18)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.17](https://github.com/bentoml/openllm/tree/v0.4.17)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.16](https://github.com/bentoml/openllm/tree/v0.4.16)
|
|
|
|
### Changes
|
|
|
|
- Update vLLM to 0.2.2, bringing supports and a lot of improvement upstream
|
|
[#695](https://github.com/bentoml/openllm/issues/695)
|
|
|
|
|
|
### Features
|
|
|
|
- Added experimental CTranslate backend to run on CPU, that yields higher TPS comparing to PyTorch counterpart.
|
|
|
|
This has been tested on c5.4xlarge instances
|
|
[#698](https://github.com/bentoml/openllm/issues/698)
|
|
|
|
## [0.4.15](https://github.com/bentoml/openllm/tree/v0.4.15)
|
|
|
|
### Features
|
|
|
|
- PyTorch runners now supports logprobs calculation for the `logits` output.
|
|
|
|
Update logits calculation to support encoder-decoder models (which fix T5 inference)
|
|
[#692](https://github.com/bentoml/openllm/issues/692)
|
|
|
|
## [0.4.14](https://github.com/bentoml/openllm/tree/v0.4.14)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.13](https://github.com/bentoml/openllm/tree/v0.4.13)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.12](https://github.com/bentoml/openllm/tree/v0.4.12)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.11](https://github.com/bentoml/openllm/tree/v0.4.11)
|
|
|
|
### Bug fix
|
|
|
|
- Fixes a environment generation bug that caused CONFIG envvar to be invalid JSON
|
|
[#680](https://github.com/bentoml/openllm/issues/680)
|
|
|
|
## [0.4.10](https://github.com/bentoml/openllm/tree/v0.4.10)
|
|
|
|
### Changes
|
|
|
|
- `openllm build` from 0.4.10 will start locking packages for hemerticity
|
|
|
|
We also remove some of the packages that is not required, since it should already be in the base image.
|
|
|
|
Improve general codegen for service_vars to static save all variables in `_service_vars.py` to save two access call in envvar.
|
|
The envvar for all variables are still there in the container for backwards compatibility.
|
|
[#669](https://github.com/bentoml/openllm/issues/669)
|
|
|
|
|
|
### Features
|
|
|
|
- Type hints for all exposed API are now provided through stubs. This means REPL
|
|
and static analysis tools like mypy can infer types from library instantly without
|
|
having to infer types from runtime function signatures.
|
|
[#663](https://github.com/bentoml/openllm/issues/663)
|
|
- OpenLLM image sizes now has been compressed and reduced to around 6.75 GB uncompressed.
|
|
[#675](https://github.com/bentoml/openllm/issues/675)
|
|
|
|
## [0.4.9](https://github.com/bentoml/openllm/tree/v0.4.9)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.8](https://github.com/bentoml/openllm/tree/v0.4.8)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.7](https://github.com/bentoml/openllm/tree/v0.4.7)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.6](https://github.com/bentoml/openllm/tree/v0.4.6)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.5](https://github.com/bentoml/openllm/tree/v0.4.5)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.4](https://github.com/bentoml/openllm/tree/v0.4.4)
|
|
|
|
### Features
|
|
|
|
- Certain warnings can now be disabled with `OPENLLM_DISABLE_WARNINGS=True` in the environment.
|
|
|
|
`openllm.LLM` now also brings `embedded` mode. By default this is True. if `embedded=True`, then
|
|
the model will be loaded eagerly. This should only be used during developmen
|
|
|
|
```python
|
|
|
|
import openllm
|
|
|
|
llm = openllm.LLM('HuggingFaceH4/zephyr-7b-beta', backend='vllm', embedded=True)
|
|
```
|
|
|
|
The default behaviour of loading the model first time when `llm.generate` or `llm.generate_iterator` is unchanged.
|
|
`embedded` option is mainly for backward compatibility and more explicit definition.
|
|
[#618](https://github.com/bentoml/openllm/issues/618)
|
|
|
|
## [0.4.3](https://github.com/bentoml/openllm/tree/v0.4.3)
|
|
|
|
### Features
|
|
|
|
- OpenLLM server now provides a helpers endpoint to help easily create new prompt and other utilities in the future
|
|
|
|
`/v1/helpers/messages` will format a list of messages into the correct chat messages given the chat model
|
|
[#613](https://github.com/bentoml/openllm/issues/613)
|
|
- client now have an additional helpers attribute class to work with helpers endpoint
|
|
|
|
```python
|
|
client = openllm.HTTPClient()
|
|
|
|
prompt = client.helpers.messages(
|
|
add_generation_prompt=False,
|
|
messages=[
|
|
{'role': 'system', 'content': 'You are acting as Ernest Hemmingway.'},
|
|
{'role': 'user', 'content': 'Hi there!'},
|
|
{'role': 'assistant', 'content': 'Yes?'},
|
|
],
|
|
)
|
|
```
|
|
|
|
Async variant
|
|
|
|
```python
|
|
client = openllm.AsyncHTTPClient()
|
|
|
|
prompt = await client.helpers.messages(
|
|
add_generation_prompt=False,
|
|
messages=[
|
|
{'role': 'system', 'content': 'You are acting as Ernest Hemmingway.'},
|
|
{'role': 'user', 'content': 'Hi there!'},
|
|
{'role': 'assistant', 'content': 'Yes?'},
|
|
],
|
|
)
|
|
```
|
|
[#615](https://github.com/bentoml/openllm/issues/615)
|
|
|
|
## [0.4.2](https://github.com/bentoml/openllm/tree/v0.4.2)
|
|
|
|
### Changes
|
|
|
|
- Update client implementation and support Authentication through `OPENLLM_AUTH_TOKEN`
|
|
[#605](https://github.com/bentoml/openllm/issues/605)
|
|
|
|
|
|
### Refactor
|
|
|
|
- ## Auto backend detection
|
|
|
|
By default, OpenLLM will use vLLM (if available) to run the server. We recommend users to always explicitly set backend to `--backend vllm` for the best performance.
|
|
|
|
if vLLM is not available, OpenLLM will fall back to PyTorch backend. Note that the PyTorch backend won't be as performant
|
|
|
|
## Revamped CLI interface
|
|
|
|
This is a part of the recent restructure of `openllm.LLM`
|
|
|
|
For all CLI, there is no need to pass in the architecture anymore. One can directly pass in the model and save a few characters
|
|
|
|
Start:
|
|
|
|
```bash
|
|
|
|
openllm start meta-llama/Llama-2-13b-chat-hf --device 0
|
|
|
|
```
|
|
|
|
Build:
|
|
|
|
```bash
|
|
|
|
openllm build meta-llama/Llama-2-13b-chat-hf --serialisation safetensors
|
|
|
|
```
|
|
|
|
Import:
|
|
|
|
```bash
|
|
|
|
openllm build mistralai/Mistral-7B-v0.1 --serialisation legacy
|
|
|
|
```
|
|
|
|
All CLI outputs will now dump JSON objects to stdout. This will ensure easier programmatic access to the CLI.
|
|
This means `--output/-o` is removed from all CLI commands, as all of them will output JSON.
|
|
|
|
Passing in `model_name` will now be deprecated and will be removed from the future. If you try `openllm start opt`, you will see the following
|
|
|
|
```bash
|
|
$ openllm start opt
|
|
|
|
Passing 'openllm start opt' is deprecated and will be remove in a future version. Use 'openllm start facebook/opt-1.3b' instead.
|
|
```
|
|
|
|
Example outputs of `openllm models`:
|
|
|
|
```bash
|
|
$ openllm models
|
|
|
|
{
|
|
"chatglm": {
|
|
"architecture": "ChatGLMModel",
|
|
"example_id": "thudm/chatglm2-6b",
|
|
"supported_backends": [
|
|
"pt"
|
|
],
|
|
"installation": "pip install \"openllm[chatglm]\"",
|
|
"items": []
|
|
},
|
|
"dolly_v2": {
|
|
"architecture": "GPTNeoXForCausalLM",
|
|
"example_id": "databricks/dolly-v2-3b",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install openllm",
|
|
"items": []
|
|
},
|
|
"falcon": {
|
|
"architecture": "FalconForCausalLM",
|
|
"example_id": "tiiuae/falcon-40b-instruct",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install \"openllm[falcon]\"",
|
|
"items": []
|
|
},
|
|
"flan_t5": {
|
|
"architecture": "T5ForConditionalGeneration",
|
|
"example_id": "google/flan-t5-small",
|
|
"supported_backends": [
|
|
"pt"
|
|
],
|
|
"installation": "pip install openllm",
|
|
"items": []
|
|
},
|
|
"gpt_neox": {
|
|
"architecture": "GPTNeoXForCausalLM",
|
|
"example_id": "eleutherai/gpt-neox-20b",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install openllm",
|
|
"items": []
|
|
},
|
|
"llama": {
|
|
"architecture": "LlamaForCausalLM",
|
|
"example_id": "NousResearch/llama-2-70b-hf",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install \"openllm[llama]\"",
|
|
"items": []
|
|
},
|
|
"mpt": {
|
|
"architecture": "MPTForCausalLM",
|
|
"example_id": "mosaicml/mpt-7b-chat",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install \"openllm[mpt]\"",
|
|
"items": []
|
|
},
|
|
"opt": {
|
|
"architecture": "OPTForCausalLM",
|
|
"example_id": "facebook/opt-2.7b",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install \"openllm[opt]\"",
|
|
"items": []
|
|
},
|
|
"stablelm": {
|
|
"architecture": "GPTNeoXForCausalLM",
|
|
"example_id": "stabilityai/stablelm-base-alpha-3b",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install openllm",
|
|
"items": []
|
|
},
|
|
"starcoder": {
|
|
"architecture": "GPTBigCodeForCausalLM",
|
|
"example_id": "bigcode/starcoder",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install \"openllm[starcoder]\"",
|
|
"items": []
|
|
},
|
|
"mistral": {
|
|
"architecture": "MistralForCausalLM",
|
|
"example_id": "amazon/MistralLite",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install openllm",
|
|
"items": []
|
|
},
|
|
"baichuan": {
|
|
"architecture": "BaiChuanForCausalLM",
|
|
"example_id": "fireballoon/baichuan-vicuna-chinese-7b",
|
|
"supported_backends": [
|
|
"pt",
|
|
"vllm"
|
|
],
|
|
"installation": "pip install \"openllm[baichuan]\"",
|
|
"items": []
|
|
}
|
|
}
|
|
```
|
|
[#592](https://github.com/bentoml/openllm/issues/592)
|
|
|
|
## [0.4.1](https://github.com/bentoml/openllm/tree/v0.4.1)
|
|
No significant changes.
|
|
|
|
|
|
## [0.4.0](https://github.com/bentoml/openllm/tree/v0.4.0)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.14](https://github.com/bentoml/openllm/tree/v0.3.14)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.13](https://github.com/bentoml/openllm/tree/v0.3.13)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.12](https://github.com/bentoml/openllm/tree/v0.3.12)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.10](https://github.com/bentoml/openllm/tree/v0.3.10)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.9](https://github.com/bentoml/openllm/tree/v0.3.9)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.8](https://github.com/bentoml/openllm/tree/v0.3.8)
|
|
|
|
### Backwards-incompatible Changes
|
|
|
|
- Remove embeddings endpoints from the provided API, as I think it is probably not a good fit to have them here, yet.
|
|
|
|
This means that `openllm embed` will also be removed.
|
|
|
|
Client implementation is also updated to fix 0.3.7 breaking changes with models other than Llama
|
|
[#500](https://github.com/bentoml/openllm/issues/500)
|
|
|
|
|
|
### Features
|
|
|
|
- Add `/v1/models` endpoint for OpenAI compatible API
|
|
[#499](https://github.com/bentoml/openllm/issues/499)
|
|
|
|
|
|
## [0.3.7](https://github.com/bentoml/openllm/tree/v0.3.7)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.6](https://github.com/bentoml/openllm/tree/v0.3.6)
|
|
|
|
### Features
|
|
|
|
- Added support for continuous batching on `/v1/generate`
|
|
[#375](https://github.com/bentoml/openllm/issues/375)
|
|
|
|
|
|
## [0.3.5](https://github.com/bentoml/openllm/tree/v0.3.5)
|
|
|
|
### Features
|
|
|
|
- Added support for continuous batching via vLLM
|
|
|
|
Currently benchmark shows that 100 concurrent requests shows around 1218 TPS on 1 A100 running meta-llama/Llama-2-13b-chat-hf
|
|
[#349](https://github.com/bentoml/openllm/issues/349)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Set a default serialisation for all models.
|
|
|
|
Currently, only Llama 2 will use safetensors as default format. For all other models, if they have safetensors format, then it will can be opt-int via `--serialisation safetensors`
|
|
[#355](https://github.com/bentoml/openllm/issues/355)
|
|
|
|
|
|
## [0.3.4](https://github.com/bentoml/openllm/tree/v0.3.4)
|
|
|
|
### Bug fix
|
|
|
|
- vLLM now should support safetensors loading format, so `--serlisation` should be agnostic of backend now
|
|
|
|
Removed some legacy check and default behaviour
|
|
[#324](https://github.com/bentoml/openllm/issues/324)
|
|
|
|
|
|
## [0.3.3](https://github.com/bentoml/openllm/tree/v0.3.3)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.2](https://github.com/bentoml/openllm/tree/v0.3.2)
|
|
No significant changes.
|
|
|
|
|
|
## [0.3.1](https://github.com/bentoml/openllm/tree/v0.3.1)
|
|
|
|
### Changes
|
|
|
|
- revert back to only release pure wheels
|
|
|
|
disable compiling wheels for now once we move to different implementation
|
|
[#304](https://github.com/bentoml/openllm/issues/304)
|
|
|
|
|
|
## [0.3.0](https://github.com/bentoml/openllm/tree/v0.3.0)
|
|
|
|
### Backwards-incompatible Changes
|
|
|
|
- All environment variable now will be more simplified, without the need for the specific model prefix
|
|
|
|
For example: OPENLLM_LLAMA_GENERATION_MAX_NEW_TOKENS now becomes OPENLLM_GENERATION_MAX_NEW_TOKENS
|
|
|
|
Unify some misc environment variable. To switch different backend, one can use `--backend` for both `start` and `build`
|
|
|
|
```bash
|
|
openllm start llama --backend vllm
|
|
```
|
|
|
|
or the environment variable `OPENLLM_BACKEND`
|
|
|
|
```bash
|
|
OPENLLM_BACKEND=vllm openllm start llama
|
|
```
|
|
|
|
`openllm.Runner` now will default to try download the model the first time if the model is not available, and get the cached in model store consequently
|
|
|
|
Model serialisation now updated to a new API version with more clear name change, kindly ask users to do `openllm prune -y --include-bentos` and update to
|
|
this current version of openllm
|
|
[#283](https://github.com/bentoml/openllm/issues/283)
|
|
|
|
|
|
### Refactor
|
|
|
|
- Refactor GPTQ to use official implementation from transformers>=4.32
|
|
[#297](https://github.com/bentoml/openllm/issues/297)
|
|
|
|
|
|
### Features
|
|
|
|
- Added support for vLLM streaming
|
|
|
|
This can now be accessed via `/v1/generate_stream`
|
|
[#260](https://github.com/bentoml/openllm/issues/260)
|
|
|
|
|
|
## [0.2.27](https://github.com/bentoml/openllm/tree/v0.2.27)
|
|
|
|
### Changes
|
|
|
|
- Define specific style guideline for the project. See
|
|
[STYLE.md](https://github.com/bentoml/OpenLLM/blob/main/STYLE.md) for more
|
|
information.
|
|
[#168](https://github.com/bentoml/openllm/issues/168)
|
|
|
|
|
|
### Refactor
|
|
|
|
- Expose all extension via `openllm extension`
|
|
|
|
Added a separate section for all extension with the CLI. `openllm playground` is now considered as an extension
|
|
|
|
introduce compiled wheels gradually
|
|
|
|
added a easy `cz.py` for code golf and LOC
|
|
[#191](https://github.com/bentoml/openllm/issues/191)
|
|
- Refactor openllm_js to openllm-node for initial node library development
|
|
[#199](https://github.com/bentoml/openllm/issues/199)
|
|
- OpenLLM now comprise of three packages:
|
|
|
|
- `openllm-core`: main building blocks of OpenLLM, that doesn't depend on transformers and heavy DL libraries
|
|
- `openllm-client`: The implementation of `openllm.client`
|
|
- `openllm`: = `openllm-core` + `openllm-client` + DL features (under `openllm-python`)
|
|
|
|
OpenLLM now will provide `start-grpc` as opt-in. If you wan to use `openllm start-grpc`, make sure to install
|
|
with `pip install "openllm[grpc]"`
|
|
[#249](https://github.com/bentoml/openllm/issues/249)
|
|
|
|
|
|
### Features
|
|
|
|
- OpenLLM now provides SSE support
|
|
|
|
> [!NOTE]
|
|
> For this to work, you must install BentoML>=1.1.2:
|
|
> `pip install -U bentoml>=1.1.2`
|
|
|
|
The endpoint can be accessed via `/v1/generate_stream`
|
|
|
|
> [!NOTE]
|
|
> Curl does in fact does support SSE by passing in `-N`
|
|
[#240](https://github.com/bentoml/openllm/issues/240)
|
|
|
|
|
|
## [0.2.26](https://github.com/bentoml/openllm/tree/v0.2.26)
|
|
|
|
### Features
|
|
|
|
- Added a generic embedding implementation largely based on https://github.com/bentoml/sentence-embedding-bento
|
|
For all unsupported models.
|
|
[#227](https://github.com/bentoml/openllm/issues/227)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Fixes correct directory for building standalone installer
|
|
[#228](https://github.com/bentoml/openllm/issues/228)
|
|
|
|
|
|
## [0.2.25](https://github.com/bentoml/openllm/tree/v0.2.25)
|
|
|
|
### Features
|
|
|
|
- OpenLLM now include a community-maintained ClojureScript UI, Thanks @GutZuFusss
|
|
|
|
See [this README.md](/external/clojure/README.md) for more information
|
|
|
|
OpenLLM will also include a `--cors` to enable start with cors enabled.
|
|
[#89](https://github.com/bentoml/openllm/issues/89)
|
|
- Nightly wheels now can be installed via `test.pypi.org`:
|
|
|
|
```bash
|
|
pip install -i https://test.pypi.org/simple/ openllm
|
|
```
|
|
[#215](https://github.com/bentoml/openllm/issues/215)
|
|
- Running vLLM with Falcon is now supported
|
|
[#223](https://github.com/bentoml/openllm/issues/223)
|
|
|
|
|
|
## [0.2.24](https://github.com/bentoml/openllm/tree/v0.2.24)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.23](https://github.com/bentoml/openllm/tree/v0.2.23)
|
|
|
|
### Features
|
|
|
|
- Added all compiled wheels for all supported Python version for Linux and MacOS
|
|
[#201](https://github.com/bentoml/openllm/issues/201)
|
|
|
|
|
|
## [0.2.22](https://github.com/bentoml/openllm/tree/v0.2.22)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.21](https://github.com/bentoml/openllm/tree/v0.2.21)
|
|
|
|
### Changes
|
|
|
|
- Added lazy eval for compiled modules, which should speed up overall import time
|
|
[#200](https://github.com/bentoml/openllm/issues/200)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Fixes compiled wheels ignoring client libraries
|
|
[#197](https://github.com/bentoml/openllm/issues/197)
|
|
|
|
|
|
## [0.2.20](https://github.com/bentoml/openllm/tree/v0.2.20)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.19](https://github.com/bentoml/openllm/tree/v0.2.19)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.18](https://github.com/bentoml/openllm/tree/v0.2.18)
|
|
|
|
### Changes
|
|
|
|
- Runners server now will always spawn one instance regardless of the configuration of workers-per-resource
|
|
|
|
i.e: If CUDA_VISIBLE_DEVICES=0,1,2 and `--workers-per-resource=0.5`, then runners will only use `0,1` index
|
|
[#189](https://github.com/bentoml/openllm/issues/189)
|
|
|
|
|
|
### Features
|
|
|
|
- OpenLLM now can also be installed via brew tap:
|
|
```bash
|
|
brew tap bentoml/openllm https://github.com/bentoml/openllm
|
|
|
|
brew install openllm
|
|
```
|
|
[#190](https://github.com/bentoml/openllm/issues/190)
|
|
|
|
|
|
## [0.2.17](https://github.com/bentoml/openllm/tree/v0.2.17)
|
|
|
|
### Changes
|
|
|
|
- Updated loading logics for PyTorch and vLLM where it will check for initialized parameters after placing to correct devices
|
|
|
|
Added xformers to base container for requirements on vLLM-based container
|
|
[#185](https://github.com/bentoml/openllm/issues/185)
|
|
|
|
|
|
### Features
|
|
|
|
- Importing models now won't load into memory if it is a remote ID. Note that for GPTQ and local model the behaviour is unchanged.
|
|
|
|
Fixes that when there is one GPU, we ensure to call `to('cuda')` to place the model onto the memory. Note that the GPU must have
|
|
enough VRAM to offload this model onto the GPU.
|
|
[#183](https://github.com/bentoml/openllm/issues/183)
|
|
|
|
|
|
## [0.2.16](https://github.com/bentoml/openllm/tree/v0.2.16)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.15](https://github.com/bentoml/openllm/tree/v0.2.15)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.14](https://github.com/bentoml/openllm/tree/v0.2.14)
|
|
|
|
### Bug fix
|
|
|
|
- Fixes a bug with `EnvVarMixin` where it didn't respect environment variable for specific fields
|
|
|
|
This inherently provide a confusing behaviour with `--model-id`. This is now has been addressed with main
|
|
|
|
The base docker will now also include a installation of xformers from source, locked at a given hash, since the latest release of xformers
|
|
are too old and would fail with vLLM when running within the k8s
|
|
[#181](https://github.com/bentoml/openllm/issues/181)
|
|
|
|
|
|
## [0.2.13](https://github.com/bentoml/openllm/tree/v0.2.13)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.12](https://github.com/bentoml/openllm/tree/v0.2.12)
|
|
|
|
### Features
|
|
|
|
- Added support for base container with OpenLLM. The base container will contains all necessary requirements
|
|
to run OpenLLM. Currently it does included compiled version of FlashAttention v2, vLLM, AutoGPTQ and triton.
|
|
|
|
This will now be the base image for all future BentoLLM. The image will also be published to public GHCR.
|
|
|
|
To extend and use this image into your bento, simply specify ``base_image`` under ``bentofile.yaml``:
|
|
|
|
```yaml
|
|
docker:
|
|
base_image: ghcr.io/bentoml/openllm:<hash>
|
|
```
|
|
|
|
The release strategy would include:
|
|
- versioning of ``ghcr.io/bentoml/openllm:sha-<sha1>`` for every commit to main, ``ghcr.io/bentoml/openllm:0.2.11`` for specific release version
|
|
- alias ``latest`` will be managed with docker/build-push-action (discouraged)
|
|
|
|
Note that all these images include compiled kernels that has been tested on Ampere GPUs with CUDA 11.8.
|
|
|
|
To quickly run the image, do the following:
|
|
|
|
```bash
|
|
docker run --rm --gpus all -it -v /home/ubuntu/.local/share/bentoml:/tmp/bentoml -e BENTOML_HOME=/tmp/bentoml \
|
|
-e OPENLLM_USE_LOCAL_LATEST=True -e OPENLLM_BACKEND=vllm ghcr.io/bentoml/openllm:2b5e96f90ad314f54e07b5b31e386e7d688d9bb2 start llama --model-id meta-llama/Llama-2-7b-chat-hf --workers-per-resource conserved --debug`
|
|
```
|
|
|
|
In conjunction with this, OpenLLM now also have a set of small CLI utilities via ``openllm ext`` for ease-of-use
|
|
|
|
General fixes around codebase bytecode optimization
|
|
|
|
Fixes logs output to filter correct level based on ``--debug`` and ``--quiet``
|
|
|
|
``openllm build`` now will default run model check locally. To skip it pass in ``--fast`` (before this is the default behaviour, but ``--no-fast`` as default makes more sense here as ``openllm build`` should also be able to run standalone)
|
|
|
|
All ``LlaMA`` namespace has been renamed to ``Llama`` (internal change and shouldn't affect end users)
|
|
|
|
``openllm.AutoModel.for_model`` now will always return the instance. Runner kwargs will be handled via create_runner
|
|
[#142](https://github.com/bentoml/openllm/issues/142)
|
|
- All OpenLLM base container now are scanned for security vulnerabilities using
|
|
trivy (both SBOM mode and CVE)
|
|
[#169](https://github.com/bentoml/openllm/issues/169)
|
|
|
|
|
|
## [0.2.11](https://github.com/bentoml/openllm/tree/v0.2.11)
|
|
|
|
### Features
|
|
|
|
- Added embeddings support for T5 and ChatGLM
|
|
[#153](https://github.com/bentoml/openllm/issues/153)
|
|
|
|
|
|
## [0.2.10](https://github.com/bentoml/openllm/tree/v0.2.10)
|
|
|
|
### Features
|
|
|
|
- Added installing with git-archival support
|
|
|
|
```bash
|
|
pip install "https://github.com/bentoml/openllm/archive/main.tar.gz"
|
|
```
|
|
[#143](https://github.com/bentoml/openllm/issues/143)
|
|
- Users now can call ``client.embed`` to get the embeddings from the running LLMServer
|
|
|
|
```python
|
|
client = openllm.client.HTTPClient("http://localhost:3000")
|
|
|
|
client.embed("Hello World")
|
|
client.embed(["Hello", "World"])
|
|
```
|
|
|
|
> **Note:** The ``client.embed`` is currently only implemnted for ``openllm.client.HTTPClient`` and ``openllm.client.AsyncHTTPClient``
|
|
|
|
Users can also query embeddings directly from the CLI, via ``openllm embed``:
|
|
|
|
```bash
|
|
$ openllm embed --endpoint localhost:3000 "Hello World" "My name is Susan"
|
|
|
|
[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
|
|
```
|
|
[#146](https://github.com/bentoml/openllm/issues/146)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Fixes model location while running within BentoContainer correctly
|
|
|
|
This makes sure that the tags and model path are inferred correctly, based on BENTO_PATH and /.dockerenv
|
|
[#141](https://github.com/bentoml/openllm/issues/141)
|
|
|
|
|
|
## [0.2.9](https://github.com/bentoml/openllm/tree/v0.2.9)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.8](https://github.com/bentoml/openllm/tree/v0.2.8)
|
|
|
|
### Features
|
|
|
|
- APIs for LLMService are now provisional based on the capabilities of the LLM.
|
|
|
|
The following APIs are considered provisional:
|
|
|
|
- `/v1/embeddings`: This will be available if the LLM supports embeddings (i.e: ``LLM.embeddings`` is implemented. Example model are ``llama``)
|
|
- `/hf/agent`: This will be available if LLM supports running HF agents (i.e: ``LLM.generate_one`` is implemented. Example model are ``starcoder``, ``falcon``.)
|
|
- `POST /v1/adapters` and `GET /v1/adapters`: This will be available if the server is running with LoRA weights
|
|
|
|
``openllm.LLMRunner`` now include three additional boolean:
|
|
- `runner.supports_embeddings`: Whether this runner supports embeddings
|
|
- `runner.supports_hf_agent`: Whether this runner support HF agents
|
|
- `runner.has_adapters`: Whether this runner is loaded with LoRA adapters.
|
|
|
|
Optimized ``openllm.models``'s bytecode performance
|
|
[#133](https://github.com/bentoml/openllm/issues/133)
|
|
|
|
|
|
## [0.2.7](https://github.com/bentoml/openllm/tree/v0.2.7)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.6](https://github.com/bentoml/openllm/tree/v0.2.6)
|
|
|
|
### Backwards-incompatible Changes
|
|
|
|
- Updated signature for `load_model` and `load_tokenizer` not to allow tag.
|
|
Tag can be accessed via `llm.tag`, or if using `openllm.serialisation` or `bentoml.transformers` then you can use `self._bentomodel`
|
|
|
|
Updated serialisation shared logics to reduce callstack for saving three calltrace.
|
|
[#132](https://github.com/bentoml/openllm/issues/132)
|
|
|
|
|
|
## [0.2.5](https://github.com/bentoml/openllm/tree/v0.2.5)
|
|
|
|
### Features
|
|
|
|
- Added support for sending arguments via CLI.
|
|
|
|
```python
|
|
openllm query --endpoint localhost:3000 "What is the difference between noun and pronoun?" --sampling-params temperature 0.84
|
|
```
|
|
|
|
Fixed llama2 qlora training script to save unquantized weights
|
|
[#130](https://github.com/bentoml/openllm/issues/130)
|
|
|
|
|
|
## [0.2.4](https://github.com/bentoml/openllm/tree/v0.2.4)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.3](https://github.com/bentoml/openllm/tree/v0.2.3)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.2](https://github.com/bentoml/openllm/tree/v0.2.2)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.1](https://github.com/bentoml/openllm/tree/v0.2.1)
|
|
No significant changes.
|
|
|
|
|
|
## [0.2.0](https://github.com/bentoml/openllm/tree/v0.2.0)
|
|
|
|
### Features
|
|
|
|
- Added support for GPTNeoX models. All variants of GPTNeoX, including Dolly-V2
|
|
and StableLM can now also use `openllm start gpt-neox`
|
|
|
|
`openllm models -o json` nows return CPU and GPU field. `openllm models` now
|
|
show table that mimics the one from README.md
|
|
|
|
Added scripts to automatically add models import to `__init__.py`
|
|
|
|
`--workers-per-resource` now accepts the following strategies:
|
|
|
|
- `round_robin`: Similar behaviour when setting `--workers-per-resource 1`. This
|
|
is useful for smaller models.
|
|
- `conserved`: This will determine the number of available GPU resources, and
|
|
only assign one worker for the LLMRunner with all available GPU resources. For
|
|
example, if ther are 4 GPUs available, then `conserved` is equivalent to
|
|
`--workers-per-resource 0.25`.
|
|
[#106](https://github.com/bentoml/openllm/issues/106)
|
|
- Added support for [Baichuan](https://github.com/baichuan-inc/Baichuan-7B) model
|
|
generation, contributed by @hetaoBackend
|
|
|
|
Fixes how we handle model loader auto class for trust_remote_code in
|
|
transformers
|
|
[#115](https://github.com/bentoml/openllm/issues/115)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Fixes relative model_id handling for running LLM within the container.
|
|
|
|
Added support for building container directly with `openllm build`. Users now
|
|
can do `openllm build --format=container`:
|
|
|
|
```bash
|
|
openllm build flan-t5 --format=container
|
|
```
|
|
|
|
This is equivalent to:
|
|
|
|
```bash
|
|
openllm build flan-t5 && bentoml containerize google-flan-t5-large-service
|
|
```
|
|
|
|
Added Snapshot testing and more robust edge cases for model testing
|
|
|
|
General improvement in `openllm.LLM.import_model` where it will parse santised
|
|
parameters automatically.
|
|
|
|
Fixes `openllm start <bento>` to use correct `model_id`, ignoring `--model-id`
|
|
(The correct behaviour)
|
|
|
|
Fixes `--workers-per-resource conserved` to respect `--device`
|
|
|
|
Added initial interface for `LLM.embeddings`
|
|
[#107](https://github.com/bentoml/openllm/issues/107)
|
|
- Fixes resources to correctly follows CUDA_VISIBLE_DEVICES spec
|
|
|
|
OpenLLM now contains a standalone parser that mimic `torch.cuda` parser for set
|
|
GPU devices. This parser will be used to parse both AMD and NVIDIA GPUs.
|
|
|
|
`openllm` should now be able to parse `GPU-` and `MIG-` UUID from both
|
|
configuration or spec.
|
|
[#114](https://github.com/bentoml/openllm/issues/114)
|
|
|
|
|
|
## [0.1.20](https://github.com/bentoml/openllm/tree/v0.1.20)
|
|
|
|
### Features
|
|
|
|
- ### Fine-tuning support for Falcon
|
|
|
|
Added support for fine-tuning Falcon models with QLoRa
|
|
|
|
OpenLLM now brings a `openllm playground`, which create a jupyter notebook for
|
|
easy fine-tuning script
|
|
|
|
Currently, it supports fine-tuning OPT and Falcon, more to come.
|
|
|
|
`openllm.LLM` now provides a `prepare_for_training` helpers to easily setup LoRA
|
|
and related configuration for fine-tuning
|
|
[#98](https://github.com/bentoml/openllm/issues/98)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Fixes loading MPT config on CPU
|
|
|
|
Fixes runner StopIteration on GET for Starlette App
|
|
[#92](https://github.com/bentoml/openllm/issues/92)
|
|
- `openllm.LLM` now generates tags based on given `model_id` and optional
|
|
`model_version`.
|
|
|
|
If given `model_id` is a custom path, the name would be the basename of the
|
|
directory, and version would be the hash of the last modified time.
|
|
|
|
`openllm start` now provides a `--runtime`, allowing setup different runtime.
|
|
Currently it refactors to `transformers`. GGML is working in progress.
|
|
|
|
Fixes miscellaneous items when saving models with quantized weights.
|
|
[#102](https://github.com/bentoml/openllm/issues/102)
|
|
|
|
|
|
## [0.1.19](https://github.com/bentoml/openllm/tree/v0.1.19)
|
|
No significant changes.
|
|
|
|
|
|
## [0.1.18](https://github.com/bentoml/openllm/tree/v0.1.18)
|
|
|
|
### Features
|
|
|
|
- `openllm.LLMConfig` now supports `dict()` protocol
|
|
|
|
```bash
|
|
|
|
config = openllm.LLMConfig.for_model("opt")
|
|
|
|
print(config.items())
|
|
print(config.values())
|
|
print(config.keys())
|
|
print(dict(config))
|
|
```
|
|
[#85](https://github.com/bentoml/openllm/issues/85)
|
|
- Added supports for MPT to OpenLLM
|
|
|
|
Fixes a LLMConfig to only parse environment when it is available
|
|
[#91](https://github.com/bentoml/openllm/issues/91)
|
|
|
|
|
|
## [0.1.17](https://github.com/bentoml/openllm/tree/v0.1.17)
|
|
|
|
### Bug fix
|
|
|
|
- Fixes loading logics from custom path. If given model path are given, OpenLLM
|
|
won't try to import it to the local store.
|
|
|
|
OpenLLM now only imports and fixes the models to loaded correctly within the
|
|
bento, see the generated service for more information.
|
|
|
|
Fixes service not ready when serving within a container or on BentoCloud. This
|
|
has to do with how we load the model before in the bento.
|
|
|
|
Falcon loading logics has been reimplemented to fix this major bug. Make sure to
|
|
delete all previous save weight for falcon with `openllm prune`
|
|
|
|
`openllm start` now supports bento
|
|
|
|
```bash
|
|
openllm start llm-bento --help
|
|
```
|
|
[#80](https://github.com/bentoml/openllm/issues/80)
|
|
|
|
|
|
## [0.1.16](https://github.com/bentoml/openllm/tree/v0.1.16)
|
|
No significant changes.
|
|
|
|
|
|
## [0.1.15](https://github.com/bentoml/openllm/tree/v0.1.15)
|
|
|
|
### Features
|
|
|
|
- `openllm.Runner` now supports AMD GPU, addresses #65.
|
|
|
|
It also respect CUDA_VISIBLE_DEVICES set correctly, allowing disabling GPU and
|
|
run on CPU only.
|
|
[#72](https://github.com/bentoml/openllm/issues/72)
|
|
|
|
|
|
## [0.1.14](https://github.com/bentoml/openllm/tree/v0.1.14)
|
|
|
|
### Features
|
|
|
|
- Added support for standalone binary distribution. Currently works on Linux and
|
|
Windows:
|
|
|
|
The following are supported:
|
|
|
|
- aarch64-unknown-linux-gnu
|
|
- x86_64-unknown-linux-gnu
|
|
- x86_64-unknown-linux-musl
|
|
- i686-unknown-linux-gnu
|
|
- powerpc64le-unknown-linux-gnu
|
|
- x86_64-pc-windows-msvc
|
|
- i686-pc-windows-msvc
|
|
|
|
Reverted matrices expansion for CI to all Python version. Now leveraging Hatch
|
|
env matrices
|
|
[#66](https://github.com/bentoml/openllm/issues/66)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Moved implementation of dolly-v2 and falcon serialization to save PreTrainedModel instead of pipeline.
|
|
|
|
Save dolly-v2 now save the actual model instead of the pipeline abstraction. If you have a Dolly-V2
|
|
model available locally, kindly ask you to do `openllm prune` to have the new implementation available.
|
|
|
|
Dolly-v2 and falcon nows implements some memory optimization to help with loading with lower resources system
|
|
|
|
Configuration removed field: 'use_pipeline'
|
|
[#60](https://github.com/bentoml/openllm/issues/60)
|
|
- Remove duplicated class instance of `generation_config` as it should be set via
|
|
instance attributes.
|
|
|
|
fixes tests flakiness and one broken cases for parsing env
|
|
[#64](https://github.com/bentoml/openllm/issues/64)
|
|
|
|
|
|
## [0.1.13](https://github.com/bentoml/openllm/tree/v0.1.13)
|
|
No significant changes.
|
|
|
|
|
|
## [0.1.12](https://github.com/bentoml/openllm/tree/v0.1.12)
|
|
|
|
### Features
|
|
|
|
- Serving LLM with fine-tuned LoRA, QLoRA adapters layers
|
|
|
|
Then the given fine tuning weights can be served with the model via
|
|
`openllm start`:
|
|
|
|
```bash
|
|
openllm start opt --model-id facebook/opt-6.7b --adapter-id /path/to/adapters
|
|
```
|
|
|
|
If you just wish to try some pretrained adapter checkpoint, you can use
|
|
`--adapter-id`:
|
|
|
|
```bash
|
|
openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora
|
|
```
|
|
|
|
To use multiple adapters, use the following format:
|
|
|
|
```bash
|
|
openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora --adapter-id aarnphm/opt-6.7b-lora:french_lora
|
|
```
|
|
|
|
By default, the first `adapter-id` will be the default lora layer, but
|
|
optionally users can change what lora layer to use for inference via
|
|
`/v1/adapters`:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/v1/adapters --json '{"adapter_name": "vn_lora"}'
|
|
```
|
|
|
|
> Note that for multiple `adapter-name` and `adapter-id`, it is recomended to
|
|
> update to use the default adapter before sending the inference, to avoid any
|
|
> performance degradation
|
|
|
|
To include this into the Bento, one can also provide a `--adapter-id` into
|
|
`openllm build`:
|
|
|
|
```bash
|
|
openllm build opt --model-id facebook/opt-6.7b --adapter-id ...
|
|
```
|
|
|
|
Separate out configuration builder, to make it more flexible for future
|
|
configuration generation.
|
|
[#52](https://github.com/bentoml/openllm/issues/52)
|
|
|
|
|
|
### Bug fix
|
|
|
|
- Fixes how `llm.ensure_model_id_exists` parse `openllm download` correctly
|
|
|
|
Renamed `openllm.utils.ModelEnv` to `openllm.utils.EnvVarMixin`
|
|
[#58](https://github.com/bentoml/openllm/issues/58)
|
|
|
|
|
|
## [0.1.11](https://github.com/bentoml/openllm/tree/v0.1.11)
|
|
No significant changes.
|
|
|
|
|
|
## [0.1.10](https://github.com/bentoml/openllm/tree/v0.1.10)
|
|
No significant changes.
|
|
|
|
|
|
## [0.1.9](https://github.com/bentoml/openllm/tree/v0.1.9)
|
|
|
|
### Changes
|
|
|
|
- Fixes setting logs for agents to info instead of logger object.
|
|
[#37](https://github.com/bentoml/openllm/issues/37)
|
|
|
|
|
|
## [0.1.8](https://github.com/bentoml/openllm/tree/v0.1.8)
|
|
No significant changes.
|
|
|
|
|
|
## [0.1.7](https://github.com/bentoml/openllm/tree/v0.1.7)
|
|
|
|
### Features
|
|
|
|
- OpenLLM now seamlessly integrates with HuggingFace Agents.
|
|
Replace the HfAgent endpoint with a running remote server.
|
|
|
|
```python
|
|
import transformers
|
|
|
|
agent = transformers.HfAgent("http://localhost:3000/hf/agent") # URL that runs the OpenLLM server
|
|
|
|
agent.run("Is the following `text` positive or negative?", text="I don't like how this models is generate inputs")
|
|
```
|
|
|
|
Note that only `starcoder` is currently supported for agent feature.
|
|
|
|
To use it from the `openllm.client`, do:
|
|
```python
|
|
import openllm
|
|
|
|
client = openllm.client.HTTPClient("http://123.23.21.1:3000")
|
|
|
|
client.ask_agent(
|
|
task="Is the following `text` positive or negative?",
|
|
text="What are you thinking about?",
|
|
agent_type="hf",
|
|
)
|
|
```
|
|
|
|
Fixes a Asyncio exception by increasing the timeout
|
|
[#29](https://github.com/bentoml/openllm/issues/29)
|
|
|
|
|
|
## [0.1.6](https://github.com/bentoml/openllm/tree/v0.1.6)
|
|
|
|
### Changes
|
|
|
|
- `--quantize` now takes `int8, int4` instead of `8bit, 4bit` to be consistent
|
|
with bitsandbytes concept.
|
|
|
|
`openllm CLI` now cached all available model command, allow faster startup time.
|
|
|
|
Fixes `openllm start model-id --debug` to filtered out debug message log from
|
|
`bentoml.Server`.
|
|
|
|
`--model-id` from `openllm start` now support choice for easier selection.
|
|
|
|
Updated `ModelConfig` implementation with **getitem** and auto generation value.
|
|
|
|
Cleanup CLI and improve loading time, `openllm start` should be 'blazingly
|
|
fast'.
|
|
[#28](https://github.com/bentoml/openllm/issues/28)
|
|
|
|
|
|
### Features
|
|
|
|
- Added support for quantization during serving time.
|
|
|
|
`openllm start` now support `--quantize int8` and `--quantize int4` `GPTQ`
|
|
quantization support is on the roadmap and currently being worked on.
|
|
|
|
Refactored `openllm.LLMConfig` to be able to use with `__getitem__`:
|
|
`openllm.DollyV2Config()['requirements']`.
|
|
|
|
The access order being:
|
|
`__openllm_*__ > self.<key> > __openllm_generation_class__ > __openllm_extras__`.
|
|
|
|
Added `towncrier` workflow to easily generate changelog entries
|
|
|
|
`LLMConfig` now supported `__dataclass_transform__` protocol to help with
|
|
type-checking
|
|
|
|
`openllm download-models` now becomes `openllm download`
|
|
[#27](https://github.com/bentoml/openllm/issues/27)
|