Skip to Content
Confident AI is free to try . No credit card required.
LLM Evaluation
Testing Reports

Testing Reports

Test runs are displayed as testing reports on Confident AI and represents a snapshot of your LLM application’s performance. This is what your organization will be using to assess whether the latest iteration of your LLM application is up to standard.

💡

It is not uncommon for testing reports to drive deployment decisions within an engineering team.

You can create a test runs by running an evaluation, including those ran in CI/CD piplines.

Video Summary

Loading video...

Navigating Testing Reports on Confident AI

0 views • 0 days ago
Confident AI Logo
Confident AI
100K subscribers
0

Metric Score Analysis

A testing reports’s Overview page provides comprehensive analysis of your evaluation metrics, including:

  • Count of passing and failing metrics
  • Distribution analysis of metric scores
  • Average and standard deviation calculations
  • Median and quartile calculations
  • Metric trends over time

Test Case Explorer

Examine, filter, and search for individual test cases with comprehensive information in the Test Cases page:

  • View actual_outputs, retrieval_contexts, tools_called, completion_time, token_cost, etc.
  • Review metric results including scores, reasoning, thresholds, and error if any
  • Debug by assessing verbose logs for metrics
  • Filter & search for test cases by name, metric status, metric scores, and passing status

Dataset Creation

Create new datasets from test cases in your test run by going to the Test Cases page, and click Save as new dataset. The actual_outputs will not be included in the goldens of your newly created dataset since it is against best practices on Confident AI.

Report Sharing

Share and collaborate on testing reports with your team, both on the Overview and Test Cases page:

  • Internally shareable links for test runs
  • Publicaly shareable links for test runs
  • Export reports in CSV or JSON format

Hyperparameter Tracking

Monitor and analyze your prompts, models, etc. configuration associated with the testing results:

  • Record model versions
  • Log prompt versions, either text or messages, used
  • Track additional parameters in a freeform key-value pair (e.g. for embedding models)
  • Filter & search for test runs associated with certain hyperparameters

Test Run Comparison

Compare different iterations of your LLM app in the Compare Test Results page:

  • Side-by-side comparison of test runs, including metric score distribution, and % of passing test cases
  • Highlight comparable test cases that have regressed or improved in performance
  • Filter for regression or improvement only test cases
  • Share comparison insights via links

This feature in a testing report is used for A|B regression testing different iterations of your LLM app. More information on this feature on the next section.

Last updated on