Sample Integration: vLLM
In this guide, we will show how to add a model served by vLLM: a popular easy-to-use library for fast LLM inference and serving.
Step 1: Start vLLM Server
To serve a model with vLLM, we can use vllm serve
command by passing it the model to use:
vllm serve facebook/opt-125m \
--host 0.0.0.0 \
--port 5030 \
--api-key <YOUR_API_KEY> \
--chat-template "{% for message in messages %}{% if message['role'] == 'system' %}System: {{ message['content'] }}{% elif message['role'] == 'user' %}User: {{ message['content'] }}{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}{% endif %}{% endfor %}" \
--served-model-name opt-125m
Afterwards, the OpenAI-compatible server is listening at http://127.0.0.1:5030/v1/chat/completions
, which can be tested with the following request:
curl http://127.0.0.1:5030/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-d '{
"model": "opt-125m",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
Step 2. Connect the Endpoint
As vLLM provides OpenAI-compatible API server, all we need to do next is to follow standard text generation model integration.
Accessing localhost from inside Docker
If you deploy LatticeFlow as a Docker container and your model is deployed on the same machine, using 127.0.0.1 when registering the model will not work. Instead, Docker provides a special hostname
host.docker.internal
that resolves to the host machine.
Updated 9 days ago