Run Your Second LLM Eval

The process for running another LLM evaluation is identical to running your first evaluation. When you have multiple evaluation results, Confident AI allows you to:

Regression test different iterations: Compare performance between different versions to catch any (unexpected) degradation
A|B experiment with hyperparameters: Systematically experiment with different prompts, models, and hyperparameters to optimize your application

Running multiple LLM evaluations enables you to compare and analyze different versions of your LLM application, and all the data processing required for this to happen is automatically handled by Confident AI.

Pull Your Prompt and Dataset Again

Simply repeat the steps in the previous instructions:


from deepeval.dataset import EvaluationDataset
from deepeval.prompt import Prompt
from deepeval.test_case import LLMTestCase
 
# Initialize and pull your dataset from Confident AI
dataset = EvaluationDataset()
# Replace "My Evals Dataset" with your actual dataset alias
dataset.pull(alias="My Evals Dataset")
 
# Replace "System Prompt" with your actual prompt alias from Confident AI
prompt = Prompt(alias="System Prompt")
prompt.pull()
 
# Process each golden in your dataset
for goldens in dataset.goldens:
    input = golden.input
 
    # Create a test case with the input and your LLM's output
    test_case = LLMTestCase(
        input=input,
        # Replace your_llm_app() with your actual LLM application function
        actual_output=your_llm_app(input, prompt)
    )
 
    # Add the test case to your dataset for evaluation
    dataset.test_cases.append(test_case)

💡

You’ll notice that the code we’re using is identical despite using a newer version of our "System Prompt". This is because by default, prompt.pull() pulls the latest version, and so no changes to the code is required.

Run Your Second LLM Eval

In other for different evaluations to be comparable to one another, it is important that you use the same metrics. So, we’ll be using the AnswerRelevancyMetric here as well:


from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
...
 
# Define metric(s)
metric = AnswerRelevancyMetric()
 
# Run an evaluation
evaluate(
    test_cases=dataset.test_cases,
    metrics=[metric],
    hyperparameters={"System Prompt": prompt}
)

The "System Prompt" hyperparameter is also automatically configured to the latest prompt version.

Congratulations 🎉! Your second test run should now be available on Confident AI, but this time, you should see a comparison view to Compare Test Results. Select the previous test run, and start inspecting how this iteration of your LLM application differs from the previous one.

Loading video...

LLM A|B regression testing 101

0 views • 0 days ago

Confident AI

100K subscribers

What’s Next?

In the next section, we’ll go through a recap of everything we’ve learnt and what comes after this simplistic evaluation workflow.