22 KiB
Generated
Changelog
We are following semantic versioning with strict backward-compatibility policy.
You can find out backwards-compatibility policy here.
Changes for the upcoming release can be found in the 'changelog.d' directory in our repository.
0.2.26
Features
- Added a generic embedding implementation largely based on https://github.com/bentoml/sentence-embedding-bento For all unsupported models. #227
Bug fix
- Fixes correct directory for building standalone installer #228
0.2.25
Features
-
OpenLLM now include a community-maintained ClojureScript UI, Thanks @GutZuFusss
See this README.md for more information
OpenLLM will also include a
--corsto enable start with cors enabled. #89 -
Nightly wheels now can be installed via
test.pypi.org:pip install -i https://test.pypi.org/simple/ openllm -
Running vLLM with Falcon is now supported #223
0.2.24
No significant changes.
0.2.23
Features
- Added all compiled wheels for all supported Python version for Linux and MacOS #201
0.2.22
No significant changes.
0.2.21
Changes
- Added lazy eval for compiled modules, which should speed up overall import time #200
Bug fix
- Fixes compiled wheels ignoring client libraries #197
0.2.20
No significant changes.
0.2.19
No significant changes.
0.2.18
Changes
-
Runners server now will always spawn one instance regardless of the configuration of workers-per-resource
i.e: If CUDA_VISIBLE_DEVICES=0,1,2 and
--workers-per-resource=0.5, then runners will only use0,1index #189
Features
- OpenLLM now can also be installed via brew tap:
#190
brew tap bentoml/openllm https://github.com/bentoml/openllm brew install openllm
0.2.17
Changes
-
Updated loading logics for PyTorch and vLLM where it will check for initialized parameters after placing to correct devices
Added xformers to base container for requirements on vLLM-based container #185
Features
-
Importing models now won't load into memory if it is a remote ID. Note that for GPTQ and local model the behaviour is unchanged.
Fixes that when there is one GPU, we ensure to call
to('cuda')to place the model onto the memory. Note that the GPU must have enough VRAM to offload this model onto the GPU. #183
0.2.16
No significant changes.
0.2.15
No significant changes.
0.2.14
Bug fix
-
Fixes a bug with
EnvVarMixinwhere it didn't respect environment variable for specific fieldsThis inherently provide a confusing behaviour with
--model-id. This is now has been addressed with mainThe base docker will now also include a installation of xformers from source, locked at a given hash, since the latest release of xformers are too old and would fail with vLLM when running within the k8s #181
0.2.13
No significant changes.
0.2.12
Features
-
Added support for base container with OpenLLM. The base container will contains all necessary requirements to run OpenLLM. Currently it does included compiled version of FlashAttention v2, vLLM, AutoGPTQ and triton.
This will now be the base image for all future BentoLLM. The image will also be published to public GHCR.
To extend and use this image into your bento, simply specify
base_imageunderbentofile.yaml:docker: base_image: ghcr.io/bentoml/openllm:<hash>The release strategy would include:
- versioning of
ghcr.io/bentoml/openllm:sha-<sha1>for every commit to main,ghcr.io/bentoml/openllm:0.2.11for specific release version - alias
latestwill be managed with docker/build-push-action (discouraged)
Note that all these images include compiled kernels that has been tested on Ampere GPUs with CUDA 11.8.
To quickly run the image, do the following:
docker run --rm --gpus all -it -v /home/ubuntu/.local/share/bentoml:/tmp/bentoml -e BENTOML_HOME=/tmp/bentoml \ -e OPENLLM_USE_LOCAL_LATEST=True -e OPENLLM_LLAMA_FRAMEWORK=vllm ghcr.io/bentoml/openllm:2b5e96f90ad314f54e07b5b31e386e7d688d9bb2 start llama --model-id meta-llama/Llama-2-7b-chat-hf --workers-per-resource conserved --debug`In conjunction with this, OpenLLM now also have a set of small CLI utilities via
openllm extfor ease-of-useGeneral fixes around codebase bytecode optimization
Fixes logs output to filter correct level based on
--debugand--quietopenllm buildnow will default run model check locally. To skip it pass in--fast(before this is the default behaviour, but--no-fastas default makes more sense here asopenllm buildshould also be able to run standalone)All
LlaMAnamespace has been renamed toLlama(internal change and shouldn't affect end users)openllm.AutoModel.for_modelnow will always return the instance. Runner kwargs will be handled via create_runner #142 - versioning of
-
All OpenLLM base container now are scanned for security vulnerabilities using trivy (both SBOM mode and CVE) #169
0.2.11
Features
- Added embeddings support for T5 and ChatGLM #153
0.2.10
Features
-
Added installing with git-archival support
pip install "https://github.com/bentoml/openllm/archive/main.tar.gz" -
Users now can call
client.embedto get the embeddings from the running LLMServer```python client = openllm.client.HTTPClient("http://localhost:3000") client.embed("Hello World") client.embed(["Hello", "World"]) ```Note: The
client.embedis currently only implemnted foropenllm.client.HTTPClientandopenllm.client.AsyncHTTPClientUsers can also query embeddings directly from the CLI, via
openllm embed:```bash $ openllm embed --endpoint localhost:3000 "Hello World" "My name is Susan" [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]] ```
Bug fix
-
Fixes model location while running within BentoContainer correctly
This makes sure that the tags and model path are inferred correctly, based on BENTO_PATH and /.dockerenv #141
0.2.9
No significant changes.
0.2.8
Features
-
APIs for LLMService are now provisional based on the capabilities of the LLM.
The following APIs are considered provisional:
/v1/embeddings: This will be available if the LLM supports embeddings (i.e:LLM.embeddingsis implemented. Example model arellama)/hf/agent: This will be available if LLM supports running HF agents (i.e:LLM.generate_oneis implemented. Example model arestarcoder,falcon.)POST /v1/adaptersandGET /v1/adapters: This will be available if the server is running with LoRA weights
openllm.LLMRunnernow include three additional boolean:runner.supports_embeddings: Whether this runner supports embeddingsrunner.supports_hf_agent: Whether this runner support HF agentsrunner.has_adapters: Whether this runner is loaded with LoRA adapters.
Optimized
openllm.models's bytecode performance #133
0.2.7
No significant changes.
0.2.6
Backwards-incompatible Changes
-
Updated signature for
load_modelandload_tokenizernot to allow tag. Tag can be accessed viallm.tag, or if usingopenllm.serialisationorbentoml.transformersthen you can useself._bentomodelUpdated serialisation shared logics to reduce callstack for saving three calltrace. #132
0.2.5
Features
-
Added support for sending arguments via CLI.
openllm query --endpoint localhost:3000 "What is the difference between noun and pronoun?" --sampling-params temperature 0.84Fixed llama2 qlora training script to save unquantized weights #130
0.2.4
No significant changes.
0.2.3
No significant changes.
0.2.2
No significant changes.
0.2.1
No significant changes.
0.2.0
Features
-
Added support for GPTNeoX models. All variants of GPTNeoX, including Dolly-V2 and StableLM can now also use
openllm start gpt-neoxopenllm models -o jsonnows return CPU and GPU field.openllm modelsnow show table that mimics the one from README.mdAdded scripts to automatically add models import to
__init__.py--workers-per-resourcenow accepts the following strategies:round_robin: Similar behaviour when setting--workers-per-resource 1. This is useful for smaller models.conserved: This will determine the number of available GPU resources, and only assign one worker for the LLMRunner with all available GPU resources. For example, if ther are 4 GPUs available, thenconservedis equivalent to--workers-per-resource 0.25. #106
-
Added support for Baichuan model generation, contributed by @hetaoBackend
Fixes how we handle model loader auto class for trust_remote_code in transformers #115
Bug fix
-
Fixes relative model_id handling for running LLM within the container.
Added support for building container directly with
openllm build. Users now can doopenllm build --format=container:openllm build flan-t5 --format=containerThis is equivalent to:
openllm build flan-t5 && bentoml containerize google-flan-t5-large-serviceAdded Snapshot testing and more robust edge cases for model testing
General improvement in
openllm.LLM.import_modelwhere it will parse santised parameters automatically.Fixes
openllm start <bento>to use correctmodel_id, ignoring--model-id(The correct behaviour)Fixes
--workers-per-resource conservedto respect--deviceAdded initial interface for
LLM.embeddings#107 -
Fixes resources to correctly follows CUDA_VISIBLE_DEVICES spec
OpenLLM now contains a standalone parser that mimic
torch.cudaparser for set GPU devices. This parser will be used to parse both AMD and NVIDIA GPUs.openllmshould now be able to parseGPU-andMIG-UUID from both configuration or spec. #114
0.1.20
Features
-
Fine-tuning support for Falcon
Added support for fine-tuning Falcon models with QLoRa
OpenLLM now brings a
openllm playground, which create a jupyter notebook for easy fine-tuning scriptCurrently, it supports fine-tuning OPT and Falcon, more to come.
openllm.LLMnow provides aprepare_for_traininghelpers to easily setup LoRA and related configuration for fine-tuning #98
Bug fix
-
Fixes loading MPT config on CPU
Fixes runner StopIteration on GET for Starlette App #92
-
openllm.LLMnow generates tags based on givenmodel_idand optionalmodel_version.If given
model_idis a custom path, the name would be the basename of the directory, and version would be the hash of the last modified time.openllm startnow provides a--runtime, allowing setup different runtime. Currently it refactors totransformers. GGML is working in progress.Fixes miscellaneous items when saving models with quantized weights. #102
0.1.19
No significant changes.
0.1.18
Features
-
openllm.LLMConfignow supportsdict()protocolconfig = openllm.LLMConfig.for_model("opt") print(config.items()) print(config.values()) print(config.keys()) print(dict(config)) -
Added supports for MPT to OpenLLM
Fixes a LLMConfig to only parse environment when it is available #91
0.1.17
Bug fix
-
Fixes loading logics from custom path. If given model path are given, OpenLLM won't try to import it to the local store.
OpenLLM now only imports and fixes the models to loaded correctly within the bento, see the generated service for more information.
Fixes service not ready when serving within a container or on BentoCloud. This has to do with how we load the model before in the bento.
Falcon loading logics has been reimplemented to fix this major bug. Make sure to delete all previous save weight for falcon with
openllm pruneopenllm startnow supports bentoopenllm start llm-bento --help
0.1.16
No significant changes.
0.1.15
Features
-
openllm.Runnernow supports AMD GPU, addresses #65.It also respect CUDA_VISIBLE_DEVICES set correctly, allowing disabling GPU and run on CPU only. #72
0.1.14
Features
-
Added support for standalone binary distribution. Currently works on Linux and Windows:
The following are supported:
- aarch64-unknown-linux-gnu
- x86_64-unknown-linux-gnu
- x86_64-unknown-linux-musl
- i686-unknown-linux-gnu
- powerpc64le-unknown-linux-gnu
- x86_64-pc-windows-msvc
- i686-pc-windows-msvc
Reverted matrices expansion for CI to all Python version. Now leveraging Hatch env matrices #66
Bug fix
-
Moved implementation of dolly-v2 and falcon serialization to save PreTrainedModel instead of pipeline.
Save dolly-v2 now save the actual model instead of the pipeline abstraction. If you have a Dolly-V2 model available locally, kindly ask you to do
openllm pruneto have the new implementation available.Dolly-v2 and falcon nows implements some memory optimization to help with loading with lower resources system
Configuration removed field: 'use_pipeline' #60
-
Remove duplicated class instance of
generation_configas it should be set via instance attributes.fixes tests flakiness and one broken cases for parsing env #64
0.1.13
No significant changes.
0.1.12
Features
-
Serving LLM with fine-tuned LoRA, QLoRA adapters layers
Then the given fine tuning weights can be served with the model via
openllm start:openllm start opt --model-id facebook/opt-6.7b --adapter-id /path/to/adaptersIf you just wish to try some pretrained adapter checkpoint, you can use
--adapter-id:openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-loraTo use multiple adapters, use the following format:
openllm start opt --model-id facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora --adapter-id aarnphm/opt-6.7b-lora:french_loraBy default, the first
adapter-idwill be the default lora layer, but optionally users can change what lora layer to use for inference via/v1/adapters:curl -X POST http://localhost:3000/v1/adapters --json '{"adapter_name": "vn_lora"}'Note that for multiple
adapter-nameandadapter-id, it is recomended to update to use the default adapter before sending the inference, to avoid any performance degradationTo include this into the Bento, one can also provide a
--adapter-idintoopenllm build:openllm build opt --model-id facebook/opt-6.7b --adapter-id ...Separate out configuration builder, to make it more flexible for future configuration generation. #52
Bug fix
-
Fixes how
llm.ensure_model_id_existsparseopenllm downloadcorrectlyRenamed
openllm.utils.ModelEnvtoopenllm.utils.EnvVarMixin#58
0.1.11
No significant changes.
0.1.10
No significant changes.
0.1.9
Changes
- Fixes setting logs for agents to info instead of logger object. #37
0.1.8
No significant changes.
0.1.7
Features
-
OpenLLM now seamlessly integrates with HuggingFace Agents. Replace the HfAgent endpoint with a running remote server.
import transformers agent = transformers.HfAgent("http://localhost:3000/hf/agent") # URL that runs the OpenLLM server agent.run("Is the following `text` positive or negative?", text="I don't like how this models is generate inputs")Note that only
starcoderis currently supported for agent feature.To use it from the
openllm.client, do:import openllm client = openllm.client.HTTPClient("http://123.23.21.1:3000") client.ask_agent( task="Is the following `text` positive or negative?", text="What are you thinking about?", agent_type="hf", )Fixes a Asyncio exception by increasing the timeout #29
0.1.6
Changes
-
--quantizenow takesint8, int4instead of8bit, 4bitto be consistent with bitsandbytes concept.openllm CLInow cached all available model command, allow faster startup time.Fixes
openllm start model-id --debugto filtered out debug message log frombentoml.Server.--model-idfromopenllm startnow support choice for easier selection.Updated
ModelConfigimplementation with getitem and auto generation value.Cleanup CLI and improve loading time,
openllm startshould be 'blazingly fast'. #28
Features
-
Added support for quantization during serving time.
openllm startnow support--quantize int8and--quantize int4GPTQquantization support is on the roadmap and currently being worked on.openllm startnow also support--bettertransformerto useBetterTransformerfor serving.Refactored
openllm.LLMConfigto be able to use with__getitem__:openllm.DollyV2Config()['requirements'].The access order being:
__openllm_*__ > self.<key> > __openllm_generation_class__ > __openllm_extras__.Added
towncrierworkflow to easily generate changelog entriesAdded
use_pipeline,bettertransformerflag into ModelSettingsLLMConfignow supported__dataclass_transform__protocol to help with type-checkingopenllm download-modelsnow becomesopenllm download#27