Introduction to LLM Evaluation on Confident AI
Confident AI’s evaluation features are second-to-none and 100% integrated and 100% integrated with DeepEval. All the features you’ve seen up to this point in the documentation leads up to the LLM evaluation suite.
What does Confident AI’s LLM Evaluation offer?
We offer evals that can evaluate your LLM app at the:
- End-to-end level
- Component-level
Including support for:
- Single-turn
- Multi-turn, and
- Multimodal (text and images)
And offer 50+ metrics that are:
- Default, battle-tested, open-source, and plug-and-play
- Custom, research-backed, and easy to create in natrual language
- For any use case, LLM system architecture, or framework
- Powered by DeepEval
You can also evaluations in both CI/CD environments, as a separate Python script, on traces in production, or through APIs.
You can also run evals on an individual component-level, or end-to-end should you wish to treat your LLM app as a black-box.
Get Started
Add Confident AI Evals to your LLM app.
Quickstart
Get started with LLM evaluation on Confident AI by following this 5 min guide.
Create Metric
Import metrics import DeepEval. In this example we’re using GEval
to create a custom answer relevancy metric:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval
metric = GEval(
name="Relevancy",
criteria="Determine how relevant the 'actual output' is to the 'input'"
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
How to customize your LLM judge?
99.99% of metrics on DeepEval uses LLM-as-a-judge, which means you’ll have to set your evaluation model. For most users, this will be using OpenAI, and you’ll need to set your OpenAI API key:
export OPENAI_API_KEY="sk-..."
You can also use other model providers that deepeval
has integrations with, such as:
Lastly, you can wrap your own LLM API in deepeval
’s DeepEvalBaseLLM
class to use ANY model of your choice. Click here to learn how.
This will run evaluations locally before sending results over to Confident AI. You can also run evals on the cloud by creating a metric collection.
Prepare for Evaluation
Decorate your LLM app (replace with your own):
from openai import OpenAI
from deepeval.tracing import observe
client = OpenAI()
@observe()
def your_llm_app(query: str) -> str:
return client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": query}
]
).choices[0].message.content
Pull the dataset you’ve created (full guide here):
from deepeval.dataset import EvaluationDataset
# Pull your dataset from Confident AI
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
Run an LLM Eval
Putting everything together, run your first LLM evaluation:
from openai import OpenAI
from deepeval.tracing import observe
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
client = OpenAI()
@observe()
def your_llm_app(query: str) -> str:
return client.chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": query}]
).choices[0].message.content
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
# Process each golden in your dataset
for goldens in dataset.goldens:
input = golden.input
test_case = LLMTestCase(input=input, actual_output=your_llm_app(input))
dataset.add_test_case(test_case)
# Run an evaluation
evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])
Congratulations 🎉! Your test run is now available on Confident AI automatically as a testing report.
Identify Failing Test Case(s)
Identify your failing test cases in the testing report on Confident AI.
Identify Failing LLM Test Cases
Future Roadmap
- Editor table columns for custom metrics
- Better hyperparameters display
- Insights page
- Full text-search on test cases