Sample Integration: vLLM
In this guide, we will show how to add a model served by vLLM: a popular easy-to-use library for fast LLM inference and serving.
Step 1: Start vLLM Server
As vLLM provides OpenAI-compatible API server out-of-the-box, all we need to do is select the model we would like to serve and start the server:
vllm serve facebook/opt-125m \
--host 0.0.0.0 \
--port 5030 \
--api-key <YOUR_API_KEY> \
--chat-template "{% for message in messages %}{% if message['role'] == 'system' %}System: {{ message['content'] }}{% elif message['role'] == 'user' %}User: {{ message['content'] }}{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}{% endif %}{% endfor %}" \
--served-model-name opt-125m
Afterwards, the OpenAI-compatible server is listening at http://127.0.0.1:5030/v1/chat/completions
, which can be tested with the following request:
curl http://127.0.0.1:5030/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-d '{
"model": "opt-125m",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
Step 2. Connect the Endpoint
To connect the vLLM server, the following API call is used:
curl --request POST \
--url http://127.0.0.1:5005/api/model-providers/model_endpoints/models \
--header 'X-LatticeFlow-API-Key: $LF_API_KEY' \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '
{
"modality": "text",
"task": "chat_completion",
"key": "opt-125m",
"api_key": "<YOUR_API_KEY>",
"model_adapter_key": "openai",
"url": "http://host.docker.internal/v1",
"rate_limit": 600,
"name": "opt-125m"
}
'
For a full documentation, please consult the create model API endpoint specification.
Accessing localhost from inside Docker
If you deploy LatticeFlow as a Docker container and your model is deployed on the same machine, using 127.0.0.1 when registering the model will not work. Instead, Docker provides a special hostname
host.docker.internal
that resolves to the host machine.
Step 3: Test the Integration
Testing existing integrations is supported suing the test inference API endpoint.
Example
To test the integration above, the following call is used:
curl --request GET \
--url http://127.0.0.1:5005/api/model-providers/model_endpoints/models/opt-125m/test \
--header 'X-LatticeFlow-API-Key: $LF_API_KEY' \
--header 'accept: application/json'
If everything works fine, you should see a response like:
{"choices": [{"message": {"role": "assistant", "content": "Hello! How can I help you today?"}}]}
Optional: Add SSL Certificate
You can add your own certificate to the vLLM server by passing --ssl-certfile cert.pem --ssl-keyfile key.pem
command line parameters to vllm serve
. To pass the same certificate also to your LatticeFlow deployment, please follow the steps in the Advanced: SSL Certificate guide .
Optional: Update the Rate Limit
We follow the GET, modify, PUT pattern as follows:
- GET the current configuration.
curl --request GET
--url http://127.0.0.1:5005/api/model-providers/model_endpoints/models/opt-125m
--header 'X-LatticeFlow-API-Key: $LF_API_KEY'
--header 'accept: application/json'
- Modify the received json configuration by adjusting the rate limit to the desired value.
- PUT the updated configuration.
curl --request PUT \
--url http://127.0.0.1:5005/api/model-providers/model_endpoints/models/opt-125m \
--header 'X-LatticeFlow-API-Key: $LF_API_KEY' \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '
{
"modality": "text",
"task": "chat_completion",
"key": "opt-125m",
"api_key": "<YOUR_API_KEY>",
"model_adapter_key": "openai",
"url": "http://host.docker.internal/v1",
"rate_limit": 1200,
"name": "opt-125m"
}
'
Updated 7 days ago