Skip to Content
Confident AI is free to try . No credit card required.
Getting StartedRun Another Evaluation

Run Your Second LLM Eval

The process for running another LLM evaluation is identical to running your first evaluation. When you have multiple evaluation results, Confident AI allows you to:

  1. Regression test different iterations: Compare performance between different versions to catch any (unexpected) degradation
  2. A|B experiment with hyperparameters: Systematically experiment with different prompts, models, and hyperparameters to optimize your application

Running multiple LLM evaluations enables you to compare and analyze different versions of your LLM application, and all the data processing required for this to happen is automatically handled by Confident AI.

Pull Your Prompt and Dataset Again

Simply repeat the steps in the previous instructions:

from deepeval.dataset import EvaluationDataset from deepeval.prompt import Prompt from deepeval.test_case import LLMTestCase # Initialize and pull your dataset from Confident AI dataset = EvaluationDataset() # Replace "My Evals Dataset" with your actual dataset alias dataset.pull(alias="My Evals Dataset") # Replace "System Prompt" with your actual prompt alias from Confident AI prompt = Prompt(alias="System Prompt") prompt.pull() # Process each golden in your dataset for goldens in dataset.goldens: input = golden.input # Create a test case with the input and your LLM's output test_case = LLMTestCase( input=input, # Replace your_llm_app() with your actual LLM application function actual_output=your_llm_app(input, prompt) ) # Add the test case to your dataset for evaluation dataset.test_cases.append(test_case)
💡

You’ll notice that the code we’re using is identical despite using a newer version of our "System Prompt". This is because by default, prompt.pull() pulls the latest version, and so no changes to the code is required.

Run Your Second LLM Eval

In other for different evaluations to be comparable to one another, it is important that you use the same metrics. So, we’ll be using the AnswerRelevancyMetric here as well:

from deepeval.metrics import AnswerRelevancyMetric from deepeval import evaluate ... # Define metric(s) metric = AnswerRelevancyMetric() # Run an evaluation evaluate( test_cases=dataset.test_cases, metrics=[metric], hyperparameters={"System Prompt": prompt} )

The "System Prompt" hyperparameter is also automatically configured to the latest prompt version.

Congratulations 🎉! Your second test run should now be available on Confident AI, but this time, you should see a comparison view to Compare Test Results. Select the previous test run, and start inspecting how this iteration of your LLM application differs from the previous one.

Loading video...

LLM A|B regression testing 101

0 views • 0 days ago
Confident AI Logo
Confident AI
100K subscribers
0

What’s Next?

In the next section, we’ll go through a recap of everything we’ve learnt and what comes after this simplistic evaluation workflow.

Last updated on