Create Metrics On The Cloud

Running metrics on the cloud produces exactly the same results as running metrics locally. However, you might want to run metrics on cloud when:

You need to run online evaluations in production
You want to enable non-technical team members to run evals directly from the UI
You are not using Python, in this case you can delegate the evals to Confident AI’s hosted DeepEval servers instead

Otherwise, we would recommend running metrics locally instead since it gives you more control in customizations.

Create a Custom Metric

This step is not strictly required, but if you wish to utilize custom metrics you can create one in the Metrics > Library page. A custom metric on Confident AI uses DeepEval’s G-Eval metric under the hood, so be sure to click on the link to learn what each parameter does as you’re creating one.

Loading video...

Custom Metrics on Platform

0 views • 0 days ago

Confident AI

100K subscribers

The name of your custom metric must not be a reserved one on Confident AI.

Create a Metric Collection

To create a metric collection, in your project space go to Metrics > Collections, click on the Create Collection button, and enter a collection name. Your collection name must not already be taken in your project.

💡

Metric collections can be either single or multi-turn collections. If you’re looking to run online metrics on threads for example, you should create a multi-turn collection instead.

Creating a collection of metrics on Confident AI allows you to specify which group of metrics you wish to evaluate your LLM application on, including any custom metrics you’ve created.

Loading video...

Creating metric collections on Confident AI

0 views • 0 days ago

Confident AI

100K subscribers

Add a metric

Click on Add metric in your newly created collection, and select the metric you wish to add to it.

The choice of metrics available to you will be different depending on whether your collection is single or multi-turn.

Configure metric settings

When you add a metric to a collection, you’ll have the option to configure each individual metric’s threshold, explanability, and strictness. There are three settings you can be tuning:

Threshold: Determines the minimum evaluation score required for your metric to pass. If a metric fails, the test case also fails. Defaults to 0.5.
Include reason: When turned on, a metric will generate a reason alongside the evaluation score for each metric run. Defaults to True.
Strict mode: When turned on, a metric will pass only and if only the evaluation is perfect (ie. 1.0). Defaults to False.

You can change them and click the Save button, otherwise they will be defaulted to their default values.

Using Your Metric Collection

In development

There are two ways to run evals using the metric collection you’ve defined:

Through Typescript or an HTTPS POST request that sends over a list of test cases with the generated outputs from your LLM app, or
On the platform directly, which will be triggered through the click of a button without the need of code

Click on the respective links to learn how.

In production

The only way to run online evaluations in production is by providing the name of your created metric collection to the observe functions you’ve defined, which you can learn how here.