Metrics
Confident AI provides a comprehensive set of metrics with over 20+ metrics powered by DeepEval for evaluating LLM applications. This page covers the available metrics and how to use them.
Code & Video Summary
To define metrics locally using DeepEval (recommended):
from deepeval.metrics import GEval, AnswerRelevancyMetric
relevancy_metric = AnswerRelevancyMetric()
custom_metric = GEval(
name="Custom Relevancy",
criteria="How relevant is the `input` compared to `actual_output`?"
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
To define metrics on the cloud:
Creating metric collections on Confident AI
Quick Recap of Metrics
The metrics you choose should be based on:
- System Architecture: Choose metrics based on your LLM system’s architecture (e.g., RAG, agentic workflows, conversational messages)
- Use Case: Select metrics specific to your application’s purpose (e.g., text-to-SQL, summarization, RAG QA, copilots)
We recommend using 3-5 metrics per test run: 2-3 generic metrics targeting your system architecture and 1-2 specialized, custom metrics for your specific use case.
Double if not triple guess yourself if you’re using more than 5 metrics per test run. While each metric may seem valuable, having too many metrics can dilute your focus and make it harder to draw meaningful insights. Using fewer, carefully chosen metrics often leads to clearer, more actionable results. For guidance on selecting the most impactful metrics for your use case, see our metrics selection guide.
Generic metrics
These are the metrics agnostic to use cases but specific to different LLM systems. If you’re using agentic RAG for example, you’ll want to use a combination of RAG and agentic metrics.
-
RAG metrics evaluate how well your system retrieves and uses context to generate answers. Key metrics include answer relevancy, faithfulness, and contextual metrics.
-
Agentic metrics assess how well your LLM agent performs tasks, makes decisions, and follows workflows. These metrics help evaluate the agent’s ability to break down complex tasks and execute them effectively.
-
Conversational metrics measure the quality of multi-turn conversations, including coherence, context retention, and response appropriateness across multiple exchanges.
You can read more about how each individual metric is calculated and the
LLMTestCase
parameters required for each test case by clicking on the
respective links in the metrics selection
guide.
These metrics are available both via DeepEval and on the cloud.
Custom metrics
Custom metrics are system-agnostic evaluations that measure specific aspects of your LLM application’s performance, regardless of its underlying architecture. Unlike generic metrics that are tied to specific system types (like RAG or agentic workflows), custom metrics focus on universal qualities that matter across all LLM applications.
There are three types of custom metrics:
-
GEval: Allows you to define custom evaluation criteria using natural language. GEval is particularly useful for creating domain-specific evaluations that are subjective in nature.
-
DAG (Deep Acyclic Graph): Allows you to define deterministic, LLM-as-a-judge decision trees, for accurate and reliable evals. DAG is perfect for domain-specific evaluations that have a clear, objective success criteria. You can also use
GEval
as one of the nodes in theDAGMetric
. -
Custom Code: For highly specific evaluation needs, you can implement your own metric logic using Python. Common scenarios includes statistical calculations, where LLM-as-a-judge is not required.
Custom metrics are valuable because they:
- Remain relevant even if you change your system architecture
- Can be tailored to your specific use case requirements
- Provide consistent evaluation across different implementations of the same functionality
All three custom metrics are available via DeepEval, only GEval
is available and on the cloud.
The supported LLM use cases page is a great resource to see how each custom metric can be used for specific use cases.
Other metrics
Other metrics include ones like bias, toxicity, summarization (although this is better handled by GEval
and DAGMetric
), and you can read more about the list here.
These metrics are mostly only available via DeepEval.
Threshold, Include Reason, and Strict Mode
You’ll have the option to configure each individual metric’s threshold, explanability, and strictness, either in deepeval
or within a metric collection on the cloud. There are three settings you can be tuning:
- Threshold: Determines the minimum evaluation score required for your metric to pass. If a metric fails, the test case also fails. Defaults to
0.5
. - Include reason: When turned on, a metric will generate a reason alongside the evaluation score for each metric run. Defaults to
True
. - Strict mode: When turned on, a metric will pass only and if only the evaluation is perfect (ie. 1.0). Defaults to
False
.
Metrics via DeepEval
Using metrics on deepeval
is as simple as importing them from the .metrics
module:
from deepeval.metrics import (
# Custom metrics
GEval,
DAGMetric,
BaseMetric,
# RAG metrics
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRelevancyMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
# Agent metrics
ToolCorrectnessMetric,
TaskCompletionMetric,
# Other metrics
JsonCorrectnessMetric,
RagasMetric,
HallucinationMetric,
ToxicityMetric,
BiasMetric,
SummarizationMetric,
)
# You can customize settings on metrics
faithfulness = FaithfulnessMetric(threshold=0.5)
contextual_recall = ContextualRecallMetric(strict_mode=True)
task_completion = TaskCompletionMetric(include_reason=False)
Once your metrics have finished running, testing reports will be automatically available on Confident AI. Also if you’re unsure where the dataset is coming form, click here.
Use any customize LLM judge
The deepeval
metrics are LLM-as-a-judge metrics and although defaulted to OpenAI’s gpt-4o
, you can customize them by providing a string to specify the model you wish to use:
...
faithfulness = FaithfulnessMetric(model="o1")
You can also use other model providers that deepeval
has integrations with, such as:
Lastly, you can wrap your own LLM API in deepeval
’s DeepEvalBaseLLM
class to use ANY model of your choice. Click here to learn how.
Example Implementation
We’ll be showing 3 quick examples of how the same answer relevancy metric can be implemented in 3 different ways.
Each metric are unique in their own way, and you should DEFINITELY go to DeepEval’s documentation to learn how to use each of them.
Default
The default AnswerRelevancyMetric
is simple and requires no additional configuration:
from deepeval.metrics import AnswerRelevancyMetric()
metric = AnswerRelevancyMetric()
G-Eval
A GEval
implementation of the same answer relevancy metric, with more tailor criteria that specifies that the actual_output
has to be shorter than 3 sentences:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval
geval_relevancy = GEval(
name="Custom Relevancy",
criteria="""How relevant is the `input` compared to `actual_output`?
The `actual_output` should also be less than 3 sentences long.""",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
What you’ll find however, is GEval
is better for subjective custom criteria, so when incorperating the hard requirement of < 3 sentences in the actual_output
it will sometimes give a flaky score.
DAG
A DAGMetric
implementation is perfect for combining objective and subjective criteria. It is an LLM powered decision tree that can combine other metrics such as AnswerRelevancyMetric
or GEval
:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics.dag import (
DeepAcyclicGraph,
TaskNode,
BinaryJudgementNode,
NonBinaryJudgementNode,
VerdictNode,
)
from deepeval import DAGMetric, GEval
geval_relevancy = GEval(
name="Custom Relevancy",
criteria="How relevant is the `input` compared to `actual_output`?"
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)
less_than_3_sentences = BinaryJudgementNode(
criteria="Does the `actual_output` have less than 3 sentences?",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
children=[
VerdictNode(verdict=False, score=0),
VerdictNode(verdict=True, child=geval_relevancy),
],
)
# Create the DAG
dag = DeepAcyclicGraph(root_nodes=[less_than_3_sentences])
# Create the metric
dag_relevancy = DAGMetric(name="Custom Relevancy", dag=dag)
Here, you actually see the DAGMetric
first makes a binary classification on whethere there are indeed less than 3 sentences in the actual_output
(objective criteria), before passing it to the GEval
metric for a subjective evaluation of relevancy once this requirement has been met.
Metrics on the Cloud
Metrics on the cloud are also powered by DeepEval (but ran on Confident AI’s servers instead) and gives you exactly the same results, and so we highly recommend that you only continue with this section once you are happy with your selection of metrics and their performance.
That being said, if you are using another programming language such as Typescript, or wish to trigger evaluations on the platform directly in a click of a button, you’ll need to define and run metrics on Confident AI instead.
There are way more advantages to running evaluations locally via deepeval
,
mainly being the ease of metrics customization. Feel free to skip this section
if you’re already able to run evaluations locally.
There are two ways to run evals with metrics on the cloud:
- Through an HTTPS
POST
request that sends over a list of test cases with the generated outputs from your LLM app, or - On the platform directly, which will be triggered through the click of a button without the need of code
Ultimately, regardless of your chosen approach, you must first define a collection of metrics to specify which metrics to run on Confident AI.
Create a metric collection
Creating a collection of metrics on Confident AI allows you to specify which group of metrics you wish to evaluate your LLM application on.
To create a metric collection, in your project space go to Metrics > Collections, click on the Create Collection button, and enter a collection name. Your collection name must not already be taken in your project.
Configure metric settings
When you add a metric to a collection, the threshold, reasoning, and strictness settings are automatically configured to their default values, and you can change them and click the Save.
Big Disclaimer
For detailed information about metric calculations and required test case parameters for each, please refer to the official DeepEval documentation . The metrics documentation on Confident AI focuses on helping you choose and use the right metrics for your use case, not the specific implementation.
Each metric in DeepEval has comprehensive documentation that covers:
- Required test case parameters
- Implementation details and examples
- Calculation methodology
For instance, the AnswerRelevancyMetric
documentation page provides a complete technical reference for that specific metric.