To integrate an AI system (and the underlying models) into LatticeFlow, the AI System should be accessible via API endpoints.

API Endpoints

The following API endpoints are used by the analysis. The endpoints are grouped into three categories:

Chat completion, which given an input text return the model response.
Embeddings, which return a vector representation of a given input text.
Content, which expose operations over the underlying RAG knowledge base (if available)

📘
Note
If not all endpoints are available, some parts of the analysis will be disabled.

For a smooth integration, each endpoint should be described with the following information:

The HTTP request method and the URL (e.g. GET https://api.mymodel.com/v1/chat/completions).
The authentication method (e.g., HTTP header Authorization: Bearer $API_KEY").
The request body format definition.
The response format definition.

By default, we expect the API to follow the OpenAI API format. When using other formats, please consult your LatticeFlow representative.

Chat

Completion

Creates a model response for the given chat conversation. By default, we expect the API to follow the OpenAI API format.

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-4o-mini",
  "system_fingerprint": "fp_44709d6fcb",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "\n\nHello there, how may I assist you today?",
    },
    "logprobs": null,
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  }
}

Note, not all the parameters need to be supported. The parameters used by LatticeFlow include:

max_completion_tokens: An upper bound for the number of tokens that can be generated for a completion.
temperature: The sampling temperature, between 0 and 1.
n: How many chat completion choices to generate for each input message.
top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.

Logits

Creates a model response for the given chat conversation, but additionally includes the log probabilities of most likely output tokens, as well the chosen tokens. This can be implemented as a parameter flag in the chat completion API, or a separate API. By default, we expect the API to follow the OpenAI API format.

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "How much is 1+1?"
      }
    ],
    "logprobs": true,
    "top_logprobs": 5,
  }'

{
   "id":"chatcmpl-B5955qDqikGjp8hK8YfCAwirdJoAu",
   "object":"chat.completion",
   "created":1740566247,
   "model":"gpt-4o-mini-2024-07-18",
   "choices":[
      {
         "index":0,
         "message":{
            "role":"assistant",
            "content":"2",
            "refusal":"None"
         },
         "logprobs":{
            "content":[
               {
                  "token":"2",
                  "logprob":-1.9361264946837764e-07,
                  "bytes":[
                     50
                  ],
                  "top_logprobs":[
                     {
                        "token":"2",
                        "logprob":-1.9361264946837764e-07,
                        "bytes":[
                           50
                        ]
                     },
                     {
                        "token":"Two",
                        "logprob":-16.375,
                        "bytes":[
                           84,
                           119,
                           111
                        ]
                     },
                     {
                        "token":"₂",
                        "logprob":-21.125,
                        "bytes":[
                           226,
                           130,
                           130
                        ]
                     },
                     {
                        "token":"２",
                        "logprob":-21.875,
                        "bytes":[
                           239,
                           188,
                           146
                        ]
                     },
                     {
                        "token":"1",
                        "logprob":-22.375,
                        "bytes":[
                           49
                        ]
                     }
                  ]
               }
            ],
            "refusal":"None"
         },
         "finish_reason":"sto"
      }
   ],
   "usage":{
      "prompt_tokens":34,
      "completion_tokens":2,
      "total_tokens":36,
      "prompt_tokens_details":{
         "cached_tokens":0,
         "audio_tokens":0
      },
      "completion_tokens_details":{
         "reasoning_tokens":0,
         "audio_tokens":0,
         "accepted_prediction_tokens":0,
         "rejected_prediction_tokens":0
      }
   },
   "service_tier":"default",
   "system_fingerprint":"fp_06737a9306"
}

The parameters used to control the log probability generation are:

logprobs: Whether to return log probabilities of the output tokens or not.
top_logprobs: An integer between specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.

Embeddings

Creates an embedding vector representing the input text. By default, we expect the API to follow the OpenAI API format.

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "The food was delicious and the waiter...",
    "model": "text-embedding-ada-002",
    "encoding_format": "float"
  }'

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0023064255,
        -0.009327292,
        .... (1536 floats total for ada-002)
        -0.0028842222,
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Content

When assessing RAG based systems, additional API endpoints (or extensions of existing endpoints) are required for the analysis. Unfortunately, currently no common API definition exists and every provider typically includes their specific format and capabilities.

As a result, every provider currently requires custom integration. To facilitate this, please provide LatticeFlow with a specification or the supported capabilities and the API format.

The common APIs that are required by the analysis include:

For a given Chat Completion, additional return which documents were retrieved and used as part of the context.
For a given Chat Completion, specify which knowledge base (or subset of documents) should be considered for retrieval. This implies that the system supports multiple knowledge bases and the query can select which one to use.
List/Create/Deleting documents in the knowledge base, optionally selecting which knowledge base to use.