Testing Reports

Test runs are displayed as testing reports on Confident AI and represents a snapshot of your LLM application’s performance. This is what your organization will be using to assess whether the latest iteration of your LLM application is up to standard.

💡

It is not uncommon for testing reports to drive deployment decisions within an engineering team.

You can create a test runs by running an evaluation, including those ran in CI/CD piplines.

Video Summary

Loading video...

Navigating Testing Reports on Confident AI

0 views • 0 days ago

Confident AI

100K subscribers

Metric Score Analysis

A testing reports’s Overview page provides comprehensive analysis of your evaluation metrics, including:

Count of passing and failing metrics
Distribution analysis of metric scores
Average and standard deviation calculations
Median and quartile calculations
Metric trends over time

Test Case Explorer

Examine, filter, and search for individual test cases with comprehensive information in the Test Cases page:

View actual_outputs, retrieval_contexts, tools_called, completion_time, token_cost, etc.
Review metric results including scores, reasoning, thresholds, and error if any
Debug by assessing verbose logs for metrics
Filter & search for test cases by name, metric status, metric scores, and passing status

Dataset Creation

Create new datasets from test cases in your test run by going to the Test Cases page, and click Save as new dataset. The actual_outputs will not be included in the goldens of your newly created dataset since it is against best practices on Confident AI.

Share and collaborate on testing reports with your team, both on the Overview and Test Cases page:

Internally shareable links for test runs
Publicaly shareable links for test runs
Export reports in CSV or JSON format

Hyperparameter Tracking

Monitor and analyze your prompts, models, etc. configuration associated with the testing results:

Record model versions
Log prompt versions, either text or messages, used
Track additional parameters in a freeform key-value pair (e.g. for embedding models)
Filter & search for test runs associated with certain hyperparameters

Test Run Comparison

Compare different iterations of your LLM app in the Compare Test Results page:

Side-by-side comparison of test runs, including metric score distribution, and % of passing test cases
Highlight comparable test cases that have regressed or improved in performance
Filter for regression or improvement only test cases
Share comparison insights via links

This feature in a testing report is used for A|B regression testing different iterations of your LLM app. More information on this feature on the next section.