Run LLM Evals Using Evals API
You can run an evaluation by calling the POST - /v1/evalate
endpoint.
Single-Turn Evaluation
The request body should strictly contain a single-turn metric collection name, and a list of test cases with an LLMTestCase
data model:
curl -X POST "https://api.confident-ai.com/v1/evaluate" \
-H "Content-Type: application/json" \
-H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
-d '{
"metricCollection": "Collection Name",
"testCases": [
{
"input": "How tall is mount everest?",
"actualOutput": "I don't know, pretty tall I guess?"
}
]
}'
{
"success": true,
"data": {"id": "TEST-RUN-ID"}
}
This will create a test run with the corresponding metric data on Confident AI.
Multi-Turn Evaluation
The request body should strictly contain a multi-turn metric collection name, and a list of test cases with an ConversationalTestCase
data model:
curl -X POST "https://api.confident-ai.com/v1/evaluate" \
-H "Content-Type: application/json" \
-H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
-d '{
"metricCollection": "Multi-Turn Collection Name",
"testCases": [
{
"turns": [
{"role": "user", "content": "How tall is Mount Everest?"},
{"role": "assistant", "content": "Mount Everest is approximately 8,848 meters tall."},
{"role": "user", "content": "Wow, that's really high! Has that changed recently?"},
{"role": "assistant", "content": "Yes, a 2020 survey by China and Nepal revised the height to 8,848.86 meters."}
]
}
]
}'
{
"success": true,
"data": {"id": "TEST-RUN-ID"}
}
This will create a test run with the corresponding metric data on Confident AI.
Advanced Usage
Logging parameters
To pick the best model and prompt choices based on your evaluation results, simply supply a list of "hyperparameters"
, which is a free-form key-value pair to associate your test run with your evaluation pipeline:
curl -X POST "https://api.confident-ai.com/v1/evaluate" \
-H "Content-Type: application/json" \
-H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
-d '{
"metricCollection": "Collection Name",
"testCases": [
{
"input": "How tall is mount everest?",
"actualOutput": "I don't know, pretty tall I guess?"
}
],
"hyperparameters": {
"model": "gpt-4o-mini",
"prompt version": "ai_generation_v2"
}
}'
Logging identifier
You should also aim to name each evaluation run through a custom "identifier"
argument. This would allow you to better sort and filter for test runs on Confident AI.
curl -X POST "https://api.confident-ai.com/v1/evaluate" \
-H "Content-Type: application/json" \
-H "CONFIDENT_API_KEY: <PROJECT-API-KEY>" \
-d '{
"metricCollection": "Collection Name",
"testCases": [
{
"input": "How tall is mount everest?",
"actualOutput": "I don't know, pretty tall I guess?"
}
],
"identifier": "run-399-102"
}'