LLM Evaluation Overview

Confident AI’s evaluation features are second-to-none, and in fact all features you’ve seen up to this point in the documentation leads up to the LLM evaluation suite.

Features Summary

Metrics Running Evals Unit-Testing in CI/CD Testing Reports A|B Regression Testing Insights

Simple Walkthrough

This walkthrough will show how to run LLM evaluations on Confident AI. Not all features will be covered, but if you follow all the steps you’ll have an ideal LLM evaluation pipeline setup.

Implement Metrics

Import metrics import DeepEval. In this example we’re using the AnswerRelevancyMetric:


from deepeval.metrics import AnswerRelevancyMetric
 
metric = AnswerRelevancyMetric()

99.99% of metrics on DeepEval uses LLM-as-a-judge, which means you’ll have to set your evaluation model. For most users, this will be using OpenAI, and you’ll need to set your OpenAI API key:


export OPENAI_API_KEY="sk-..."

You can also use other model providers that deepeval has integrations with, such as:

Lastly, you can wrap your own LLM API in deepeval’s DeepEvalBaseLLM class to use ANY model of your choice. Click here to learn how.

This will run evaluations locally before sending results over to Confident AI. You can also run evals on the cloud by creating a metric collection.

Prepare for Evaluation

Pull dataset

Pull the dataset you’ve created (full guide here):


from deepeval.dataset import EvaluationDataset
 
...
# Initialize and pull your dataset from Confident AI
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")

Pull prompt version

Pull the prompt version you’ve created (full guide here):


from deepeval.prompt import Prompt
 
...
# Use your actual prompt alias from Confident AI
prompt = Prompt(alias="your-prompt-alias")
prompt.pull()

Run an LLM Eval

Putting everything together, run your first LLM evaluation:


from deepeval.prompt import Prompt
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
 
 
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
 
prompt = Prompt(alias="your-prompt-alias")
prompt.pull()
 
 
# Process each golden in your dataset
for goldens in dataset.goldens:
    input = golden.input
    # Replace your_llm_app() with your actual LLM application
    test_case = LLMTestCase(input=input, actual_output=your_llm_app(input, prompt))
    dataset.test_cases.append(test_case)
 
 
# Run an evaluation
evaluate(
    test_cases=dataset.test_cases,
    metrics=[AnswerRelevancyMetric()], # Replace with your metrics
    hyperparameters={"System Prompt": prompt} # Associate prompt version with results
)

Congratulations 🎉! Your test run is now available on Confident AI automatically as a testing report.

Identify Failing Test Case(s)

Identify your failing test cases in the testing report on Confident AI.

This testing report is also publicaly sharable.

Loading video...

Identify Failing LLM Test Cases

0 views • 0 days ago

Confident AI

100K subscribers

Future Roadmap

Editor table columns for custom metrics
Better hyperparameters display
Insights page