Run LLM Evals Using Evals API

You can run an evaluation by calling the POST - /v1/evalate endpoint.

Single-Turn Evaluation

The request body should strictly contain a single-turn metric collection name, and a list of test cases with an LLMTestCase data model:

POST - /v1/evaluate


curl -X POST "https://api.confident-ai.com/v1/evaluate" \
     -H "Content-Type: application/json" \
     -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
     -d '{
            "metricCollection": "Collection Name",
            "testCases": [
                {
                    "input": "How tall is mount everest?",
                    "actualOutput": "I don't know, pretty tall I guess?"
                }
            ]
         }'

response


{
    "success": true, 
    "data": {"id": "TEST-RUN-ID"}
}

This will create a test run with the corresponding metric data on Confident AI.

Multi-Turn Evaluation

The request body should strictly contain a multi-turn metric collection name, and a list of test cases with an ConversationalTestCase data model:

POST - /v1/evaluate


curl -X POST "https://api.confident-ai.com/v1/evaluate" \
     -H "Content-Type: application/json" \
     -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
     -d '{
            "metricCollection": "Multi-Turn Collection Name",
            "testCases": [
                {
                    "turns": [
                        {"role": "user", "content": "How tall is Mount Everest?"},
                        {"role": "assistant", "content": "Mount Everest is approximately 8,848 meters tall."},
                        {"role": "user", "content": "Wow, that's really high! Has that changed recently?"},
                        {"role": "assistant", "content": "Yes, a 2020 survey by China and Nepal revised the height to 8,848.86 meters."}
                    ]
                }
            ]
         }'

response


{
    "success": true, 
    "data": {"id": "TEST-RUN-ID"}
}

This will create a test run with the corresponding metric data on Confident AI.

Advanced Usage

Logging parameters

To pick the best model and prompt choices based on your evaluation results, simply supply a list of "hyperparameters", which is a free-form key-value pair to associate your test run with your evaluation pipeline:

POST - /v1/evaluate


curl -X POST "https://api.confident-ai.com/v1/evaluate" \
     -H "Content-Type: application/json" \
     -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
     -d '{
            "metricCollection": "Collection Name",
            "testCases": [
                {
                    "input": "How tall is mount everest?",
                    "actualOutput": "I don't know, pretty tall I guess?"
                }
            ],
            "hyperparameters": {
                "model": "gpt-4o-mini",
                "prompt version": "ai_generation_v2"
            }
         }'

Logging identifier

You should also aim to name each evaluation run through a custom "identifier" argument. This would allow you to better sort and filter for test runs on Confident AI.

POST - /v1/evaluate


curl -X POST "https://api.confident-ai.com/v1/evaluate" \
     -H "Content-Type: application/json" \
     -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
     -d '{
            "metricCollection": "Collection Name",
            "testCases": [
                {
                    "input": "How tall is mount everest?",
                    "actualOutput": "I don't know, pretty tall I guess?"
                }
            ],
            "identifier": "run-399-102"
         }'