Skip to Content
Confident AI is free to try . No credit card required.
LLM Evaluation
MetricsCreate Locally

Create Metrics Locally

Confident AI provides a comprehensive set of metrics with over 40+ metrics powered by DeepEval for evaluating LLM applications. This page covers the available metrics and how to use them.

Code Summary

from deepeval.metrics import GEval, AnswerRelevancyMetric relevancy_metric = AnswerRelevancyMetric() custom_metric = GEval( name="Custom Relevancy", criteria="How relevant is the `input` compared to `actual_output`?" evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT] )

Quick Recap of Metrics

The metrics you choose should be based on:

  • System Architecture: Choose metrics based on your LLM system’s architecture (e.g., RAG, agentic workflows, conversational messages)
  • Use Case: Select metrics specific to your application’s purpose (e.g., text-to-SQL, summarization, RAG QA, copilots)

We recommend using 3-5 metrics per test run: 2-3 generic metrics targeting your system architecture and 1-2 specialized, custom metrics for your specific use case.

⚠️
Warning

Double if not triple guess yourself if you’re using more than 5 metrics per test run. While each metric may seem valuable, having too many metrics can dilute your focus and make it harder to draw meaningful insights. Using fewer, carefully chosen metrics often leads to clearer, more actionable results. For guidance on selecting the most impactful metrics for your use case, see our metrics selection guide.

Metrics via DeepEval

Using metrics on deepeval is as simple as importing them from the .metrics module:

from deepeval.metrics import ( # Custom metrics GEval, DAGMetric, BaseMetric, # RAG metrics AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric, ContextualPrecisionMetric, ContextualRecallMetric, # Agent metrics ToolCorrectnessMetric, TaskCompletionMetric, # Other metrics JsonCorrectnessMetric, RagasMetric, HallucinationMetric, ToxicityMetric, BiasMetric, SummarizationMetric, ) # You can customize settings on metrics faithfulness = FaithfulnessMetric(threshold=0.5) contextual_recall = ContextualRecallMetric(strict_mode=True) task_completion = TaskCompletionMetric(include_reason=False)

Once your metrics have finished running, testing reports will be automatically available on Confident AI. Also if you’re unsure where the dataset is coming form, click here.

Use any custom LLM judge

The deepeval metrics are LLM-as-a-judge metrics and although defaulted to OpenAI’s gpt-4o, you can customize them by providing a string to specify the model you wish to use:

... faithfulness = FaithfulnessMetric(model="o1")

You can also use other model providers that deepeval has integrations with, such as:

Lastly, you can wrap your own LLM API in deepeval’s DeepEvalBaseLLM class to use ANY model of your choice. Click here to learn how.

Example Implementation

We’ll be showing 3 quick examples of how the same answer relevancy metric can be implemented in 3 different ways.

⚠️

Each metric are unique in their own way, and you should DEFINITELY go to DeepEval’s documentation to learn how to use each of them.

Default

The default AnswerRelevancyMetric is simple and requires no additional configuration:

from deepeval.metrics import AnswerRelevancyMetric() metric = AnswerRelevancyMetric()

G-Eval

A GEval implementation of the same answer relevancy metric, with more tailor criteria that specifies that the actual_output has to be shorter than 3 sentences:

from deepeval.test_case import LLMTestCaseParams from deepeval.metrics import GEval geval_relevancy = GEval( name="Custom Relevancy", criteria="""How relevant is the `input` compared to `actual_output`? The `actual_output` should also be less than 3 sentences long.""", evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT] )

What you’ll find however, is GEval is better for subjective custom criteria, so when incorperating the hard requirement of < 3 sentences in the actual_output it will sometimes give a flaky score.

DAG

A DAGMetric implementation is perfect for combining objective and subjective criteria. It is an LLM powered decision tree that can combine other metrics such as AnswerRelevancyMetric or GEval:

from deepeval.test_case import LLMTestCaseParams from deepeval.metrics.dag import ( DeepAcyclicGraph, TaskNode, BinaryJudgementNode, NonBinaryJudgementNode, VerdictNode, ) from deepeval import DAGMetric, GEval geval_relevancy = GEval( name="Custom Relevancy", criteria="How relevant is the `input` compared to `actual_output`?" evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT] ) less_than_3_sentences = BinaryJudgementNode( criteria="Does the `actual_output` have less than 3 sentences?", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT] children=[ VerdictNode(verdict=False, score=0), VerdictNode(verdict=True, child=geval_relevancy), ], ) # Create the DAG dag = DeepAcyclicGraph(root_nodes=[less_than_3_sentences]) # Create the metric dag_relevancy = DAGMetric(name="Custom Relevancy", dag=dag)

Here, you actually see the DAGMetric first makes a binary classification on whethere there are indeed less than 3 sentences in the actual_output (objective criteria), before passing it to the GEval metric for a subjective evaluation of relevancy once this requirement has been met.

Big Disclaimer

For detailed information about metric calculations and required test case parameters for each, please refer to the official DeepEval documentation. The metrics documentation on Confident AI focuses on helping you choose and use the right metrics for your use case, not the specific implementation.

Each metric in DeepEval has comprehensive documentation that covers:

  • Required test case parameters
  • Implementation details and examples
  • Calculation methodology

For instance, the AnswerRelevancyMetric documentation page provides a complete technical reference for that specific metric.

Last updated on