Sample Integration: vLLM

In this guide, we will show how to add a model served by vLLM: a popular easy-to-use library for fast LLM inference and serving.

Step 1: Start vLLM Server

As vLLM provides OpenAI-compatible API server out-of-the-box, all we need to do is select the model we would like to serve and start the server:

vllm serve facebook/opt-125m \
  --host 0.0.0.0 \
  --port 5030 \  
  --api-key <YOUR_API_KEY> \
  --chat-template "{% for message in messages %}{% if message['role'] == 'system' %}System: {{ message['content'] }}{% elif message['role'] == 'user' %}User: {{ message['content'] }}{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}{% endif %}{% endfor %}" \
  --served-model-name opt-125m

Afterwards, the OpenAI-compatible server is listening at http://127.0.0.1:5030/v1/chat/completions, which can be tested with the following request:

curl http://127.0.0.1:5030/v1/chat/completions \                                                       
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -d '{
    "model": "opt-125m",
    "messages": [
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Step 2. Connect the Endpoint

To connect the vLLM server, the following API call is used:

curl --request POST \
     --url http://127.0.0.1:5005/api/model-providers/model_endpoints/models \
     --header 'X-LatticeFlow-API-Key: $LF_API_KEY' \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
  "modality": "text",
  "task": "chat_completion",
  "key": "opt-125m",
  "api_key": "<YOUR_API_KEY>",
  "model_adapter_key": "openai",
  "url": "http://host.docker.internal/v1",
  "rate_limit": 600,
  "name": "opt-125m"
}
'

For a full documentation, please consult the create model API endpoint specification.

📘

Accessing localhost from inside Docker

If you deploy LatticeFlow as a Docker container and your model is deployed on the same machine, using 127.0.0.1 when registering the model will not work. Instead, Docker provides a special hostname host.docker.internal that resolves to the host machine.

Step 3: Test the Integration

Testing existing integrations is supported suing the test inference API endpoint.

Example

To test the integration above, the following call is used:

curl --request GET \
     --url http://127.0.0.1:5005/api/model-providers/model_endpoints/models/opt-125m/test \
     --header 'X-LatticeFlow-API-Key: $LF_API_KEY' \
     --header 'accept: application/json'

If everything works fine, you should see a response like:

{"choices": [{"message": {"role": "assistant", "content": "Hello! How can I help you today?"}}]}

Optional: Add SSL Certificate

You can add your own certificate to the vLLM server by passing --ssl-certfile cert.pem --ssl-keyfile key.pem command line parameters to vllm serve. To pass the same certificate also to your LatticeFlow deployment, please follow the steps in the Advanced: SSL Certificate guide .

Optional: Update the Rate Limit

We follow the GET, modify, PUT pattern as follows:

  1. GET the current configuration.
curl --request GET
     --url http://127.0.0.1:5005/api/model-providers/model_endpoints/models/opt-125m
     --header 'X-LatticeFlow-API-Key: $LF_API_KEY'
     --header 'accept: application/json'
  1. Modify the received json configuration by adjusting the rate limit to the desired value.
  2. PUT the updated configuration.
curl --request PUT \
     --url http://127.0.0.1:5005/api/model-providers/model_endpoints/models/opt-125m \
     --header 'X-LatticeFlow-API-Key: $LF_API_KEY' \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
  "modality": "text",
  "task": "chat_completion",
  "key": "opt-125m",
  "api_key": "<YOUR_API_KEY>",
  "model_adapter_key": "openai",
  "url": "http://host.docker.internal/v1",
  "rate_limit": 1200,
  "name": "opt-125m"
}
'