Run End-to-End Evals

You can run end-to-end evaluations on the overall inputs and outputs of your LLM system by treating your LLM app as a black-box.

You don’t need to setup tracing for end-to-end evals, but you should still do it since it gives you maximum visibility into your LLM app.

Similar to component-level evals, end-to-end evaluations generate testing reports that enable:

Generate organization-wide sharable testing reports
A|B experimentation for regression testing
Hyperparameter experimentation
Data-driven decision making

If you wish to run evaluations on individual components in your LLM app, you should run component-level evaluations instead.

Code Summary

Evals In CI/CD


import pytest
 
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import assert_test
 
 
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias") # Pull dataset from Confident AI
 
@pytest.mark.parametrize("golden", dataset.goldens) # Loop through goldens
def test_llm_app(golden: Golden):
    test_case = LLMTestCase(
      input=golden.input, 
      actual_output=your_llm_app(golden.input) # Replace with your LLM app
    )
 
    # Replace with your metrics
    assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])

In the previous section we saw how to define metrics in deepeval, and running evaluations is as simple as using those metrics on test cases from your dataset goldens.

Define Metrics

Create a metric locally via DeepEval (more info here if unsure):


from deepeval.metrics import AnswerRelevancyMetric
 
metric = AnswerRelevancyMetric()

Pull a Dataset

Once you have defined your metric(s), it’s time to pull your dataset to create some test cases for evaluation:


from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
...
 
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
 
# Convert goldens to test cases
for golden in dataset.goldens:
    input = golden.input
    # Replace your_llm_app() with your actual LLM app
    test_case = LLMTestCase(input=input, actual_output=your_llm_app(golden.input))
    dataset.test_cases.append(test_case)
 
print(dataset.test_cases) # Sanity check yourself

If you’re unsure what this does, read about using datasets on Confident AI here.

Run an Evaluation

Similar to component-level evals, DeepEval will pass in the inputs of each individual golden in your dataset to invoke your LLM app.

In CI/CD

Unit-test your LLM app in CI/CD using DeepEval’s pytest integration via the assert_test function:

test_llm_app.py


from deepeval.metrics import AnswerRelevancyMetric
from deepeval import assert_test
...
 
@pytest.mark.parametrize("golden", dataset.goldens) # Loop through goldens
def test_llm_app(golden: Golden):
    test_case = LLMTestCase(
      input=golden.input, 
      actual_output=your_llm_app(golden.input) # Replace with your LLM app
    )
 
    assert_test(test_case=test_case, metrics=[metric]) # Replace with your metrics


deepeval test run test_llm_app.py

Congratulations 🎉! Your test run is now available on Confident AI automatically as a testing report.

💡

By default, test runs on Confident AI are assigned randomly generated IDs, which can make it difficult to identify specific runs for A|B regression testing later. You can customize it be providing your test run a custom identifier:


deepeval test run -i "My First Test Run"

Run Evals on the Cloud

If you’re using another language other than Python (i.e. Typescript) or wish to enable non-technical team members to be able to run evals without having to touch a single line of code, you can configure Confident AI to run evaluations on the cloud instead.

⚠️

You should always try to get your metrics working as expected before moving onto this section, since it is much harder to customize metrics on the cloud.

Trigger evals in Typescript or HTTPS

To trigger an evaluation on the cloud via HTTPS, simply use deepeval-ts or curl to send test cases to Confident AI for evaluation.

TypeScript


import { evaluate, EvaluationDataset, LLMTestCase } from 'deepeval-ts';
 
const dataset = new EvaluationDataset();
await dataset.pull({ alias: "your-dataset-alias" });
 
for (const golden of dataset.testCases) {
  const testCase = new LLMTestCase({
    input: golden.input,
    actualOutput: await yourLlmApp(golden.input) 
  });
  dataset.addTestCase(testCase);
}
 
// Run evals on Confident AI
await evaluate({
  metricCollection: "your-collection-name", 
  testCases: testCases
});

Not all parameters in "testCases" are required. You should figure out what is the set of parameters required for your metric collection by reading what’s required for each metric on the Metrics > Glossary page.

If you’re using Python and for some reason and you still decide to run evaluations on Confident AI, here’s how you can do it:


from deepeval.prompt import Prompt
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import confident_evaluate
 
# Initialize and pull your dataset from Confident AI
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
 
# Use your actual prompt alias from Confident AI
prompt = Prompt(alias="your-prompt-alias")
prompt.pull()
 
 
# Process each golden in your dataset
for goldens in dataset.goldens:
    input = golden.input
    # Replace your_llm_app() with your actual LLM application
    test_case = LLMTestCase(input=input, actual_output=your_llm_app(input, prompt))
    dataset.test_cases.append(test_case)
 
 
# Run evals on Confident AI
confident_evaluate(metric_collection="your-collection-name", test_cases=dataset.test_cases)

Trigger evals without code

Setup an LLM connection endpoint
Go to the Evaluations page
Press the Evaluate button
Select your metric collection and dataset
Press Evaluate to run the evaluation!

🚫

Important

You should only be doing this if you have non-technical team members that wish to trigger evaluations on the platform without going through code.

Setup Notifications (recommended)

You can also setup your project to receive notifications through either email, slack, discord, or teams each time an evaluation is completed, both locally or on the cloud, by configuring your project integrations in Project Settings > Integrations.

To learn how, visit the project integrations page.