Run Your Second LLM Eval
The process for running another LLM evaluation is identical to running your first evaluation. When you have multiple evaluation results, Confident AI allows you to:
- Regression test different iterations: Compare performance between different versions to catch any (unexpected) degradation
- A|B experiment with hyperparameters: Systematically experiment with different prompts, models, and hyperparameters to optimize your application
Running multiple LLM evaluations enables you to compare and analyze different versions of your LLM application, and all the data processing required for this to happen is automatically handled by Confident AI.
Pull Your Prompt and Dataset Again
Simply repeat the steps in the previous instructions:
from deepeval.dataset import EvaluationDataset
from deepeval.prompt import Prompt
from deepeval.test_case import LLMTestCase
# Initialize and pull your dataset from Confident AI
dataset = EvaluationDataset()
# Replace "My Evals Dataset" with your actual dataset alias
dataset.pull(alias="My Evals Dataset")
# Replace "System Prompt" with your actual prompt alias from Confident AI
prompt = Prompt(alias="System Prompt")
prompt.pull()
# Process each golden in your dataset
for goldens in dataset.goldens:
input = golden.input
# Create a test case with the input and your LLM's output
test_case = LLMTestCase(
input=input,
# Replace your_llm_app() with your actual LLM application function
actual_output=your_llm_app(input, prompt)
)
# Add the test case to your dataset for evaluation
dataset.test_cases.append(test_case)
You’ll notice that the code we’re using is identical despite using a newer
version of our "System Prompt"
. This is because by default, prompt.pull()
pulls the latest version, and so no changes to the code is required.
Run Your Second LLM Eval
In other for different evaluations to be comparable to one another, it is important that you use the same metrics. So, we’ll be using the AnswerRelevancyMetric
here as well:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
...
# Define metric(s)
metric = AnswerRelevancyMetric()
# Run an evaluation
evaluate(
test_cases=dataset.test_cases,
metrics=[metric],
hyperparameters={"System Prompt": prompt}
)
The "System Prompt"
hyperparameter is also automatically configured to the
latest prompt version.
Congratulations 🎉! Your second test run should now be available on Confident AI, but this time, you should see a comparison view to Compare Test Results. Select the previous test run, and start inspecting how this iteration of your LLM application differs from the previous one.
LLM A|B regression testing 101
What’s Next?
In the next section, we’ll go through a recap of everything we’ve learnt and what comes after this simplistic evaluation workflow.