mirror of
https://github.com/mudler/LocalAI.git
synced 2026-02-03 11:13:31 -05:00
Compare commits
6 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b7c0a108f5 | ||
|
|
f694a89c28 | ||
|
|
be682e6c2f | ||
|
|
bf85a31f9e | ||
|
|
d69048e0b0 | ||
|
|
827f189163 |
@@ -24,7 +24,6 @@ image:
|
||||
FROM +go-deps
|
||||
ARG IMAGE=alpaca-cli-nomodel
|
||||
COPY +build/llama-cli /llama-cli
|
||||
ENV MODEL_PATH=/model.bin
|
||||
ENTRYPOINT [ "/llama-cli" ]
|
||||
SAVE IMAGE --push $IMAGE
|
||||
|
||||
|
||||
40
README.md
40
README.md
@@ -1,14 +1,16 @@
|
||||
## :camel: llama-cli
|
||||
|
||||
|
||||
llama-cli is a straightforward golang CLI interface for [llama.cpp](https://github.com/ggerganov/llama.cpp), providing a simple API and a command line interface that allows text generation using a GPT-based model like llama directly from the terminal.
|
||||
llama-cli is a straightforward golang CLI interface for [llama.cpp](https://github.com/ggerganov/llama.cpp), providing a simple API and a command line interface that allows text generation using a GPT-based model like llama directly from the terminal. It is also compatible with [gpt4all](https://github.com/nomic-ai/gpt4all) and [alpaca](https://github.com/tatsu-lab/stanford_alpaca).
|
||||
|
||||
`llama-cli` uses https://github.com/go-skynet/llama, which is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp) providing golang binding.
|
||||
|
||||
## Container images
|
||||
|
||||
To begin, run:
|
||||
|
||||
```
|
||||
docker run -ti --rm quay.io/go-skynet/llama-cli:v0.3 --instruction "What's an alpaca?" --topk 10000
|
||||
docker run -ti --rm quay.io/go-skynet/llama-cli:v0.4 --instruction "What's an alpaca?" --topk 10000 --model ...
|
||||
```
|
||||
|
||||
You will receive a response like the following:
|
||||
@@ -55,7 +57,7 @@ This will generate text based on the given model and instruction.
|
||||
Example of starting the API with `docker`:
|
||||
|
||||
```bash
|
||||
docker run -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.3 api --context-size 700 --threads 4
|
||||
docker run -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.4 api --context-size 700 --threads 4
|
||||
```
|
||||
|
||||
And you'll see:
|
||||
@@ -88,7 +90,7 @@ The API takes takes the following:
|
||||
| gpt4all | GPT4ALL | false | Set to true for gpt4all models. |
|
||||
|
||||
|
||||
Once the server is running, you can make requests to it using HTTP. For example, to generate text based on an instruction, you can send a POST request to the `/predict` endpoint with the instruction as the request body:
|
||||
Once the server is running, you can start making requests to it using HTTP. For example, to generate text based on an instruction, you can send a POST request to the `/predict` endpoint with the instruction as the request body:
|
||||
|
||||
```
|
||||
curl --location --request POST 'http://localhost:8080/predict' --header 'Content-Type: application/json' --data-raw '{
|
||||
@@ -100,6 +102,8 @@ curl --location --request POST 'http://localhost:8080/predict' --header 'Content
|
||||
}'
|
||||
```
|
||||
|
||||
There is also available a simple web interface (for instance, http://localhost:8080/) which can be used as a playground.
|
||||
|
||||
Note: The API doesn't inject a template for talking to the instance, while the CLI does. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release, for instance:
|
||||
|
||||
```
|
||||
@@ -120,10 +124,10 @@ You can specify a model binary to be used for inference with `--model`.
|
||||
```
|
||||
# Download the model image, extract the model
|
||||
# Use the model with llama-cli
|
||||
docker run -v $PWD:/models -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.3-lite api --model /models/model.bin
|
||||
docker run -v $PWD:/models -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.4 api --model /models/model.bin
|
||||
```
|
||||
|
||||
gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted:
|
||||
gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too):
|
||||
|
||||
```bash
|
||||
wget -O tokenizer.model https://huggingface.co/decapoda-research/llama-30b-hf/resolve/main/tokenizer.model
|
||||
@@ -132,6 +136,7 @@ cp gpt4all.. models/
|
||||
git clone https://gist.github.com/eiz/828bddec6162a023114ce19146cb2b82
|
||||
pip install sentencepiece
|
||||
python 828bddec6162a023114ce19146cb2b82/gistfile1.txt models tokenizer.model
|
||||
# There will be a new model with the ".tmp" extension, you have to use that one!
|
||||
```
|
||||
|
||||
### Golang client API
|
||||
@@ -160,6 +165,10 @@ func main() {
|
||||
}
|
||||
```
|
||||
|
||||
### Windows compatibility
|
||||
|
||||
It should work, however you need to make sure you give enough resources to the container. See https://github.com/go-skynet/llama-cli/issues/2
|
||||
|
||||
### Kubernetes
|
||||
|
||||
You can run the API directly in Kubernetes:
|
||||
@@ -189,3 +198,22 @@ docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock --rm -t -v
|
||||
# run the binary
|
||||
./llama-cli --instruction "What's an alpaca?"
|
||||
```
|
||||
|
||||
## Short-term roadmap
|
||||
|
||||
- Mimic OpenAI API (https://github.com/go-skynet/llama-cli/issues/10)
|
||||
- Binary releases (https://github.com/go-skynet/llama-cli/issues/6)
|
||||
- Upstream our golang bindings to llama.cpp (https://github.com/ggerganov/llama.cpp/issues/351)
|
||||
- Multi-model support
|
||||
- Full Deployment and compatibility with https://github.com/mckaywrigley/chatbot-ui
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
- [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
||||
- https://github.com/tatsu-lab/stanford_alpaca
|
||||
- https://github.com/cornelk/llama-go for the initial ideas
|
||||
- https://github.com/antimatter15/alpaca.cpp for the light model version (this is compatible and tested only with that checkpoint model!)
|
||||
|
||||
Reference in New Issue
Block a user