Create Metrics On The Cloud
Running metrics on the cloud produces exactly the same results as running metrics locally. However, you might want to run metrics on cloud when:
- You need to run online evaluations in production
- You want to enable non-technical team members to run evals directly from the UI
- You are not using Python, in this case you can delegate the evals to Confident AI’s hosted DeepEval servers instead
Otherwise, we would recommend running metrics locally instead since it gives you more control in customizations.
Create a Custom Metric
This step is not strictly required, but if you wish to utilize custom metrics you can create one in the Metrics > Library page. A custom metric on Confident AI uses DeepEval’s G-Eval metric under the hood, so be sure to click on the link to learn what each parameter does as you’re creating one.
Custom Metrics on Platform
The name of your custom metric must not be a reserved one on Confident AI.
Create a Metric Collection
To create a metric collection, in your project space go to Metrics > Collections, click on the Create Collection button, and enter a collection name. Your collection name must not already be taken in your project.
Metric collections can be either single or multi-turn collections. If you’re looking to run online metrics on threads for example, you should create a multi-turn collection instead.
Creating a collection of metrics on Confident AI allows you to specify which group of metrics you wish to evaluate your LLM application on, including any custom metrics you’ve created.
Creating metric collections on Confident AI
Add a metric
Click on Add metric in your newly created collection, and select the metric you wish to add to it.
The choice of metrics available to you will be different depending on whether your collection is single or multi-turn.
Configure metric settings
When you add a metric to a collection, you’ll have the option to configure each individual metric’s threshold, explanability, and strictness. There are three settings you can be tuning:
- Threshold: Determines the minimum evaluation score required for your metric to pass. If a metric fails, the test case also fails. Defaults to
0.5
. - Include reason: When turned on, a metric will generate a reason alongside the evaluation score for each metric run. Defaults to
True
. - Strict mode: When turned on, a metric will pass only and if only the evaluation is perfect (ie. 1.0). Defaults to
False
.
You can change them and click the Save button, otherwise they will be defaulted to their default values.
Using Your Metric Collection
In development
There are two ways to run evals using the metric collection you’ve defined:
- Through Typescript or an HTTPS
POST
request that sends over a list of test cases with the generated outputs from your LLM app, or - On the platform directly, which will be triggered through the click of a button without the need of code
Click on the respective links to learn how.
In production
The only way to run online evaluations in production is by providing the name of your created metric collection to the observe functions you’ve defined, which you can learn how here.