Skip to Content
Confident AI is free to try . No credit card required.
Evals API
Evaluation
Run LLM Evals

Run LLM Evals Using Evals API

You can run an evaluation by calling the POST - /v1/evalate endpoint.

Single-Turn Evaluation

The request body should strictly contain a single-turn metric collection name, and a list of test cases with an LLMTestCase data model:

POST - /v1/evaluate
curl -X POST "https://api.confident-ai.com/v1/evaluate" \ -H "Content-Type: application/json" \ -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \ -d '{ "metricCollection": "Collection Name", "testCases": [ { "input": "How tall is mount everest?", "actualOutput": "I don't know, pretty tall I guess?" } ] }'
response
{ "success": true, "data": {"id": "TEST-RUN-ID"} }

This will create a test run with the corresponding metric data on Confident AI.

Multi-Turn Evaluation

The request body should strictly contain a multi-turn metric collection name, and a list of test cases with an ConversationalTestCase data model:

POST - /v1/evaluate
curl -X POST "https://api.confident-ai.com/v1/evaluate" \ -H "Content-Type: application/json" \ -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \ -d '{ "metricCollection": "Multi-Turn Collection Name", "testCases": [ { "turns": [ {"role": "user", "content": "How tall is Mount Everest?"}, {"role": "assistant", "content": "Mount Everest is approximately 8,848 meters tall."}, {"role": "user", "content": "Wow, that's really high! Has that changed recently?"}, {"role": "assistant", "content": "Yes, a 2020 survey by China and Nepal revised the height to 8,848.86 meters."} ] } ] }'
response
{ "success": true, "data": {"id": "TEST-RUN-ID"} }

This will create a test run with the corresponding metric data on Confident AI.

Advanced Usage

Logging parameters

To pick the best model and prompt choices based on your evaluation results, simply supply a list of "hyperparameters", which is a free-form key-value pair to associate your test run with your evaluation pipeline:

POST - /v1/evaluate
curl -X POST "https://api.confident-ai.com/v1/evaluate" \ -H "Content-Type: application/json" \ -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \ -d '{ "metricCollection": "Collection Name", "testCases": [ { "input": "How tall is mount everest?", "actualOutput": "I don't know, pretty tall I guess?" } ], "hyperparameters": { "model": "gpt-4o-mini", "prompt version": "ai_generation_v2" } }'

Logging identifier

You should also aim to name each evaluation run through a custom "identifier" argument. This would allow you to better sort and filter for test runs on Confident AI.

POST - /v1/evaluate
curl -X POST "https://api.confident-ai.com/v1/evaluate" \ -H "Content-Type: application/json" \ -H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \ -d '{ "metricCollection": "Collection Name", "testCases": [ { "input": "How tall is mount everest?", "actualOutput": "I don't know, pretty tall I guess?" } ], "identifier": "run-399-102" }'
Last updated on