Skip to Content
Confident AI is free to try . No credit card required.
LLM Evaluation
Introduction

Introduction to LLM Evaluation on Confident AI

Confident AI’s evaluation features are second-to-none, and in fact all features you’ve seen up to this point in the documentation leads up to the LLM evaluation suite.

What does Confident AI’s LLM Evaluation offer?

We offer metrics that are:

  • Default, battle-tested, open-source, and plug-and-play
  • Custom, research-backed, and easy to create in natrual language
  • For any use case, LLM system architecture, or framework

And allow you to run evaluation in both CI/CD environments and as a separate Python script.

💡

You can also run evals on an individual component-level, or end-to-end should you wish to treat your LLM app as a black-box.

Feature Highlights

Quickstart

This walkthrough will show how to run LLM evaluations on Confident AI. Not all features will be covered, but if you follow all the steps you’ll have an ideal LLM evaluation pipeline setup.

Create Metric

Import metrics import DeepEval. In this example we’re using GEval to create a custom answer relevancy metric:

from deepeval.test_case import LLMTestCaseParams from deepeval.metrics import GEval metric = GEval( name="Relevancy", criteria="Determine how relevant the 'actual output' is to the 'input'" evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT] )

How to customize your LLM judge?

99.99% of metrics on DeepEval uses LLM-as-a-judge, which means you’ll have to set your evaluation model. For most users, this will be using OpenAI, and you’ll need to set your OpenAI API key:

export OPENAI_API_KEY="sk-..."

You can also use other model providers that deepeval has integrations with, such as:

Lastly, you can wrap your own LLM API in deepeval’s DeepEvalBaseLLM class to use ANY model of your choice. Click here to learn how.

This will run evaluations locally before sending results over to Confident AI. You can also run evals on the cloud by creating a metric collection.

Prepare for Evaluation

Decorate your LLM app (replace with your own):

from openai import OpenAI from deepeval.tracing import observe @observe() def your_llm_app(query: str) -> str: return openai.ChatCompletion.create( model="gpt-4o", messages=[ {"role": "user", "content": query} ] ).choices[0].message["content"]

Pull the dataset you’ve created (full guide here):

from deepeval.dataset import EvaluationDataset # Pull your dataset from Confident AI dataset = EvaluationDataset() dataset.pull(alias="your-dataset-alias")

Run an LLM Eval

Putting everything together, run your first LLM evaluation:

from openai import OpenAI from deepeval.tracing import observe from deepeval.dataset import EvaluationDataset from deepeval.test_case import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric from deepeval import evaluate @observe() def your_llm_app(query: str) -> str: return openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}] ).choices[0].message["content"] dataset = EvaluationDataset() dataset.pull(alias="your-dataset-alias") # Process each golden in your dataset for goldens in dataset.goldens: input = golden.input test_case = LLMTestCase(input=input, actual_output=your_llm_app(input)) dataset.test_cases.append(test_case) # Run an evaluation evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Congratulations 🎉! Your test run is now available on Confident AI automatically as a testing report.

Identify Failing Test Case(s)

Identify your failing test cases in the testing report on Confident AI.

This testing report is also publicaly sharable.
Loading video...

Identify Failing LLM Test Cases

0 views • 0 days ago
Confident AI Logo
Confident AI
100K subscribers
0

Future Roadmap

  • Editor table columns for custom metrics
  • Better hyperparameters display
  • Insights page
  • Full text-search on test cases
Last updated on