Supported For All Use Cases

Confident AI is designed to evaluate any type of LLM application, from simple chatbots to complex agentic systems. Each use case has its own unique evaluation requirements, and we provide specialized metrics and features to help you get the most accurate assessment of your LLM’s performance.

A use case is something like:

RAG QA: Systems that combine document retrieval with LLM generation to provide accurate, source-based answers.
Chatbots: Conversational, multi-turn AI systems designed to engage in natural dialogues with users.
Writing Assistants: AI tools that help users improve their writing by providing suggestions, corrections, and enhancements.
Summarization: Systems that condense longer documents into shorter, coherent versions while preserving key information.
Autonomous Agents: AI systems that can independently perform complex tasks by breaking them down into manageable steps.
Text-SQL: Systems that convert natural language queries into SQL database queries.
Code Generation: Systems that create executable code from natural language descriptions.

A use case can be built using different systems. You’ll notice a clear pattern in how different systems are evaluated:

Simpler systems (like summarization and writing assistants) focus more on use case-specific custom metrics that evaluate output quality
Complex systems (like code generation and autonomous agents) require both system metrics and reference-based evaluation against golden expected_outputs, along with tracing for debugging

💡

It is recommended that you allocate one project space per use case on Confident AI

RAG QA

RAG (Retrieval-Augmented Generation) QA systems combine document retrieval with LLM generation to provide accurate, source-based answers. They first retrieve relevant documents based on a query, then use those documents as context for the LLM to generate an informed response.

A medical knowledge base that helps doctors quickly find relevant research and treatment guidelines
A legal research assistant that helps lawyers search through case law and generate summaries
A product documentation retriever that finds relevant documentation sections to answer customer queries

Let’s explore how to evaluate a medical knowledge base that helps doctors find relevant research and treatment guidelines.

Metrics

For our medical knowledge base example, we’ll want to include a mix of system-specific and use case-specific metrics. RAG QA is a balanced use case that requires both strong system performance and domain-specific evaluation. For a RAG QA system, we recommend:

Answer Relevancy (generic RAG): How well the answer addresses the query
Faithfulness (generic RAG): Whether the answer is supported by the retrieved context
Contextual Relevancy (generic RAG): How well the retrieved documents match the query
Clinical Relevance (custom G-Eval): How well the answer applies to clinical practice

In this example, the answer WITHOUT THE PROMPT TEMPLATE is the input to a test case, while the answer is the actual_output, and any medical documents retrieval to generate the answer is the retrieval_context.

Click here to learn how to create and use these metrics for evaluation.

💡

The prompt in this case should be logged as a hyperparameter instead during evaluation.

Chatbots

Chatbots are conversational, multi-turn AI systems designed to engage in natural, multi-turn dialogues with users. They can handle various tasks from customer service to information retrieval while maintaining context throughout the conversation.

A customer support chatbot that helps customers find products and make purchases
A patient triage system that helps healthcare providers assess symptoms and schedule appointments
A banking assistant that helps customers check balances and make transactions

We’ll demonstrate how to evaluate a customer support chatbot that helps customers find products and make purchases.

Metrics

For our customer support chatbot example, we’ll focus on both RAG and conversational aspects. This customer support chatbot combines both RAG and multi-turn capabilities, and so for the generic metrics we’ll use a combination of RAG and conversational metrics:

Contextual Recall (generic RAG): How well the chatbot retrieves the relevant product information
Role Adherence (generic conversational): How well the chatbot maintains its helpful, customer-focused persona
Purchase Intent Support (custom G-Eval): How well the chatbot guides customers toward making a purchase decision

In this example, the customer query is the input to an LLMTestCase in turns for a ConversationalTestCase, while the chatbot’s response is the actual_output, and any product documentation retrieved to generate the answer is the retrieval_context. The system prompt defining the chatbot’s role and personality should be provided to the chatbot_role parameter.

You can learn what a ConversationalTestCase is here.

Click here to learn how to create and use these metrics for evaluation.

Writing Assistants

Writing assistants are AI tools that help users improve their writing by providing suggestions, corrections, and enhancements. They can help with grammar, style, tone, and overall content quality while maintaining the user’s voice.

A marketing writer that helps create engaging social media posts and content
An academic writing assistant that helps students improve essays and research papers
A technical documentation generator that creates clear API descriptions

Here’s how to evaluate a marketing writer that helps create engaging social media posts and content.

Metrics

For our marketing writer example, we’ll focus on content quality and formatting. While our guide suggests using 1-2 custom metrics and 2-3 generic metrics, this writing assistant is relatively simple with minimal system complexity beyond formatting tools. This is primarily a use case-specific evaluation, focusing on the quality of the output rather than system complexity.

Since this use case is more about the specific use case requirements than the system itself, we’ll focus on three key metrics:

Tool Correctness (generic agentic): How accurately the formatting tools are applied
Format Correctness (custom DAG): How well the writing meets the specified formatting requirements
Brand Voice Alignment (custom G-Eval): How well the content matches the brand’s tone and messaging

In this example, the original text and requirements are the input to a test case, while the improved text is the actual_output. The style guides and formatting rules are part of the system’s configuration, not retrieval_context, and should most likely go in the system prompt and logged as a hyperparameter.

Click here to learn how to create and use these metrics for evaluation.

⚠️

Warning

While this use case references external context like style guides, it doesn’t require testing as a RAG pipeline. RAG pipeline testing is most valuable when the retrieval process itself could be imperfect or needs optimization.

Summarization

Text summarization systems condense longer documents into shorter, coherent versions while preserving key information. They can be extractive (pulling out important sentences) or abstractive (generating new text that captures the essence).

A meeting assistant that generates action items and key points from transcripts
A research tool that helps scientists quickly understand new papers in their field
A news aggregator that creates concise summaries of daily news articles

Let’s look at how to evaluate a meeting assistant that generates action items and key points from transcripts.

Metrics

For our meeting assistant example, we’ll focus on summary quality and accuracy. Similar to the writing assistant, summarization is primarily about the output quality rather than complex system interactions. This is another use case where the evaluation focuses more on the quality of the generated content than system complexity. We assume the system has access to the original text and doesn’t need retrieval. We’ll focus on these key metrics:

Faithfulness (generic RAG): Whether the summary hallucinates from the original text
Format Correctness (custom DAG): How well the summary follows the required structure (e.g., bullet points, sections)
Conciseness (custom G-Eval): How well the summary captures key information without unnecessary details

In this example, the original text is the input to a test case, while the summary is the actual_output. The prompts that gives the summarization instructions should be provided in the system prompt and logged as hyperparameters.

Click here to learn how to create and use these metrics for evaluation.

Autonomous Agents

Autonomous agents are AI systems that can independently perform complex tasks by breaking them down into manageable steps. They can use tools, make decisions, and adapt their approach based on feedback and changing conditions.

A travel planner that creates personalized itineraries and books accommodations
A browser agent that automates web tasks like sending emails and filling forms
A trading bot that manages investment portfolios and executes trades
A logistics manager that coordinates supply chain operations

We’ll walk through how to evaluate a travel planner that creates personalized itineraries and books accommodations.

Metrics

For our travel planner example, we’ll focus on system execution. Unlike simpler use cases, autonomous agents are system-heavy with complex execution flows. For the travel planner, we’ll focus on the core agent execution metrics rather than travel-specific outcomes:

Tool Correctness (generic agentic): How accurately the agent uses tools like search, booking, and calendar APIs
Task Completion (generic agentic): How successfully the agent completes the full travel planning workflow

In this example, the user’s travel requirements are the input to a test case, while the final itinerary and bookings are the actual_output. Click here to learn how to create and use these metrics for evaluation.

💡

For autonomous agents, setting up tracing is highly recommended. Tracing allows to debug nested components in your agent that might not be performing as expected.

Text-SQL

Text-SQL systems convert natural language queries into SQL database queries, allowing users to interact with databases using everyday language. They understand database schemas and can generate complex SQL queries that accurately reflect user intentions.

A business intelligence tool that lets data analysts query sales data
A research database that allows scientists to query experimental results

Let’s see how to evaluate a business intelligence tool that lets data analysts query sales data.

Metrics

For our business intelligence tool example, we’ll focus on SQL generation quality. Text-SQL systems usually operate as RAG systems, where the first step is retrieving relevant schema information from potentially large database structures. The generation phase then focuses on SQL correctness rather than natural language quality:

Contextual Relevancy (generic RAG): How well the retrieved schema matches the query intent
Faithfulness (generic RAG): Whether the generated SQL is supported by the retrieved schema
SQL Correctness (custom DAG): How well the generated SQL follows syntax rules and best practices

In this example, the natural language query is the input to a test case, while the generated SQL is the actual_output, and the retrieved schema information is the retrieval_context. Tracing is also extremely helpful here to visualize the retrieved tables and SQL execution times.

Database tables are typically indexed by condensed summaries of their structure and content. For example, a “sales” table might be indexed as “Contains daily sales records with columns for product_id, quantity, price, and customer_id. Used for tracking revenue and inventory.” This allows the system to quickly retrieve relevant tables based on the query intent.

Click here to learn how to create and use these metrics for evaluation.

Code Generation

Code generation systems create executable code from natural language descriptions of what the code should do. They understand programming languages, best practices, and can generate well-documented, maintainable code that meets specified requirements.

A frontend UI generator that creates frontend components and API endpoints
A code generator tool in VS-code that helps developers create basic application features

Here’s how to evaluate a frontend UI generator that creates frontend components and API endpoints.

Metrics

For our frontend UI generator example, we’ll focus on both system execution and code quality. The most complex use case of all, code generation is a system-heavy use case that requires careful evaluation of both the agent’s execution and the quality of the generated code. We’ll focus on:

Task Completion (generic agentic): How successfully the agent completes the full code generation workflow
Code Correctness (generic DAG): Whether the generated code runs without errors
Code Quality (custom G-Eval): How well the generated code compares to ideal, production-ready code

In this example, the natural language requirements are the input to a test case, while the generated code is the actual_output. You’ll definitely want tracing for this use case.

For code generation, it’s undoubtedly complex and requires expected_outputs to function well. For a code generation tool like GitHub or Cursor, you’ll also want to include contextual recall to make sure that your agent is able to retrieve the relevant code files to generate the ideal piece of code.

Click here to learn how to create and use these metrics for evaluation.

Important note

For some of the use cases, we’ve listed example metrics that we believe are most appropriate. However, you should carefully evaluate and adapt these metrics for your specific use case, even if your use case looks identical to ours on paper. While our suggested metrics may be a good starting point, but we made a lot of assumptions about the use case when coming up with the metrics.

Very rarely, some of the metrics require test cases with expected_output values. If you don’t have a labeled dataset with these expected outputs, you have two options:

Label your dataset manually (recommended)
Choose alternative metrics that don’t require labeled data