Run Component-Level Evals
You can run evaluations on a component-level by creating test cases at evaluation time. This requires you to setup tracing, which also brings additional benefits such as component-level debugging and visualization of latencies, model costs, etc. in testing reports on the UI.
Setting up tracing also automatically grants you access to all of Confident AI’s observability features for production monitoring.
Running component-level evals enables you to:
- Generate organization-wide sharable testing reports
- A|B experimentation for regression testing
- Hyperparameter experimentation
- Data-driven decision making
If you would rather treat your LLM app as a black-box, you can run end-to-end evaluations instead.
Code Summary
Evals In CI/CD
from openai import OpenAI
import pytest
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span
from deepeval import assert_test
@observe(metrics=[AnswerRelevancyMetric()])
def llm_app(query: str) -> str:
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "user", "content": query}
]
).choices[0].message["content"]
update_current_span(test_case=LLMTestCase(input=query, actual_output=res))
return res
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden)
assert_test(golden=golden, observed_callback=llm_app)
Execute file using DeepEval’s pytest
wrapper:
deepeval test run test_llm_app.py
Define Metrics
You can define your metrics by importing them from DeepEval.
from deepeval.metrics import AnswerRelevancyMetric
metric = AnswerRelevancyMetric()
Setup Tracing and Create Test Cases
Setup tracing for your LLM application with the @observe
decorator - which will allow you to create test cases at runtime for each individual component.
from deepeval.tracing import observe, update_current_span
...
# Decorate your LLM app and provide metrics
@observe(metrics=[metric])
def llm_app(query: str) -> str:
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "user", "content": query}
]
).choices[0].message["content"]
# This creates test cases for your component at evaluation time
update_current_span(test_case=LLMTestCase(input=query, actual_output=res))
return res
Tracing will also allow you to visualize and debug the latencies, model cost, etc. of each individual component in your testing report. You can read more about all of Confident AI’s tracing features and capabilities in the tracing section.
Run an Evaluation
At evaluation time, DeepEval will pass in the input
s of each individual golden in your dataset to invoke your LLM app. Be patient when waiting for evaluation results, as evaluation duration is often limited by the execution time of your LLM app.
In CI/CD
Unit-test your LLM app in CI/CD using DeepEval’s pytest
integration via the assert_test
function:
import pytest
from deepeval.dataset import EvaluationDataset, Golden
from deepeval import assert_test
...
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
# This loops through your goldens
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden)
assert_test(golden=golden, observed_callback=llm_app)
Execute deepeval test run
in the CLI to test it out:
deepeval test run test_llm_app.py
Congratulations 🎉! Your test run should now be available on Confident AI as a testing report ✅. Click around the testing report on Confident AI, and take tyour time to get familarized with it.
NOTE: Don’t forget to add this command in your .yaml
files to automate it in CI/CD pipelines such as GitHub actions!
Setup Notifications (recommended)
You can also setup your project to receive notifications through either email, slack, discord, or teams each time an evaluation is completed, both locally or on the cloud, by configuring your project integrations in Project Settings > Integrations.
To learn how, visit the project integrations page.