Testing Reports
Test runs are displayed as testing reports on Confident AI and represents a snapshot of your LLM application’s performance. This is what your organization will be using to assess whether the latest iteration of your LLM application is up to standard.
It is not uncommon for testing reports to drive deployment decisions within an engineering team.
You can create a test runs by running an evaluation, including those ran in CI/CD piplines.
Video Summary
Navigating Testing Reports on Confident AI
Metric Score Analysis
A testing reports’s Overview page provides comprehensive analysis of your evaluation metrics, including:
- Count of passing and failing metrics
- Distribution analysis of metric scores
- Average and standard deviation calculations
- Median and quartile calculations
- Metric trends over time
Test Case Explorer
Examine, filter, and search for individual test cases with comprehensive information in the Test Cases page:
- View
actual_outputs
,retrieval_contexts
,tools_called
,completion_time
,token_cost
, etc. - Review metric results including
scores
,reasoning
,thresholds
, anderror
if any - Debug by assessing verbose logs for metrics
- Filter & search for test cases by name, metric status, metric scores, and passing status
Dataset Creation
Create new datasets from test cases in your test run by going to the Test Cases page, and click Save as new dataset. The actual_output
s will not be included in the goldens of your newly created dataset since it is against best practices on Confident AI.
Report Sharing
Share and collaborate on testing reports with your team, both on the Overview and Test Cases page:
- Internally shareable links for test runs
- Publicaly shareable links for test runs
- Export reports in CSV or JSON format
Hyperparameter Tracking
Monitor and analyze your prompts, models, etc. configuration associated with the testing results:
- Record model versions
- Log prompt versions, either text or messages, used
- Track additional parameters in a freeform key-value pair (e.g. for embedding models)
- Filter & search for test runs associated with certain hyperparameters
Test Run Comparison
Compare different iterations of your LLM app in the Compare Test Results page:
- Side-by-side comparison of test runs, including metric score distribution, and % of passing test cases
- Highlight comparable test cases that have regressed or improved in performance
- Filter for regression or improvement only test cases
- Share comparison insights via links
This feature in a testing report is used for A|B regression testing different iterations of your LLM app. More information on this feature on the next section.