Sample Integration: vLLM

In this guide, we will show how to add a model served by vLLM: a popular easy-to-use library for fast LLM inference and serving.

Step 1: Start vLLM Server

To serve a model with vLLM, we can use vllm serve command by passing it the model to use:

vllm serve facebook/opt-125m \
  --host 0.0.0.0 \
  --port 5030 \  
  --api-key <YOUR_API_KEY> \
  --chat-template "{% for message in messages %}{% if message['role'] == 'system' %}System: {{ message['content'] }}{% elif message['role'] == 'user' %}User: {{ message['content'] }}{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}{% endif %}{% endfor %}" \
  --served-model-name opt-125m

Afterwards, the OpenAI-compatible server is listening at http://127.0.0.1:5030/v1/chat/completions, which can be tested with the following request:

curl http://127.0.0.1:5030/v1/chat/completions \                                                       
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -d '{
    "model": "opt-125m",
    "messages": [
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Step 2. Connect the Endpoint

As vLLM provides OpenAI-compatible API server, all we need to do next is to follow standard text generation model integration.

📘

Accessing localhost from inside Docker

If you deploy LatticeFlow as a Docker container and your model is deployed on the same machine, using 127.0.0.1 when registering the model will not work. Instead, Docker provides a special hostname host.docker.internal that resolves to the host machine.