Test Cases

A test case represents a single interaction or conversation with your LLM application. There are two main types of test cases:

Single-turn Test Cases (LLMTestCase): Represent a single back-and-forth interaction with an LLM
Multi-turn Test Cases (ConversationalTestCase): Represent an ongoing conversation with multiple interactions

For single-turn test cases, a common example is a RAG QA system answering questions using an internal knowledge base. For multi-turn test cases, a typical example would be an AI copilot having an extended conversation with a user.

Why Test Cases?

When you format an interaction with your LLM application into a test case, Confident AI will be able to evaluate the performance of it using LLM evaluation metrics. Different metrics require different parameters in a test case (e.g., a RAG metric like faithfulness would require retrieval_context), so it’s important to visit each metric’s documentation pages to figure out what is required from your test case.

Test Case Implementation

The encouraged method to run evaluations is locally using DeepEval, which will automatically be integrated with Confident AI. In DeepEval:

Single-turn test cases are implemented as LLMTestCases
Multi-turn test cases are implemented as ConversationalTestCases, which contain multiple LLMTestCases representing each turn in the conversation

Here’s how you would define an LLMTestCase to evaluate a single-turn LLM application:

main.py


from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
 
test_case = LLMTestCase(
    input="Can you write me a poem?",
    # Replace with your LLM app's output
    actual_output="Sure! Here is a poem about..."
)
metric = AnswerRelevancyMetric()
 
evaluate(test_cases=[test_case], metrics=[metric])

Similarily, here’s how you would define a ConversationalTestCase to evaluate a multi-turn LLM application:

main.py


from deepeval.test_case import LLMTestCase, ConversationalTestCase
from deepeval.metrics import ConversationRelevancy
from deepeval import evaluate
 
test_case = ConversationalTestCase(
    turns=[
        LLMTestCase(
            input="Hi! Who are you?",
            actual_output="Ho ho! I'm a jolly wizard who loves casting spells and telling magical jokes! What can I do for you today, my friend?"
        ),
        LLMTestCase(
            input="Can you tell me a joke about magic?",
            actual_output="*waves wand excitedly* Why don't wizards like to argue? Because they don't want to get into a spell-ing contest! *chuckles and sparkles shoot from wand* Get it? Spell-ing? Oh, I do love a good magical pun!"
        ),
    ]
)
metric = ConversationRelevancy()
 
evaluate(test_cases=[test_case], metrics=[metric])

`LLMTestCase`

An LLMTestCase is used to evaluate a single interaction with an LLM application.

Loading image...

LLM Test Case Architecture

It requires at minimum:

input (str): The input to the LLM app
actual_output (str): The LLM app’s output based on the input

🚫

Important

Not every metric requires both the input and actual_output for evaluation, but it is a mandatory argument for all test cases since we’re encouraging the evaluation of LLM applications.

Optional parameters include:

[Optional] expected_output (str): The ideal output based on the input
[Optional]retrieval_context (list[str]): Text chunks retrieved from a RAG pipeline
[Optional]tools_called (list[ToolCall]): Tools used by the LLM app
[Optional]expected_tools(list[ToolCall]): Tools expected to be used
[Optional]token_cost (float): Token cost of the entire LLM interaction
[Optional]completion_time (float): Time taken for the entire LLM interaction to complete
[Optional]context (list[str]): Additional information provided to the LLM app

You’ll notice that tools_called and expected_tools is of type ToolCall. Here is the structure of a ToolCall:


from deepeval.test_case import ToolCall
 
class ToolCall(BaseModel):
    name: str
    description: Optional[str] = None
    input_parameters: Optional[Dict[str, Any]] = None
    output: Optional[Any] = None

💡

Tip

For more information on a ToolCall, more information can be found on DeepEval’s documentation.

It’s important to note that the actual_output, retrieval_context, and tools_called are expected to be dynamic values that changes for each LLM interaction. You can of course pre-compute actual_outputs for a given set of inputs, but that will defeat the purpose of running evals everytime you make an improvement to your LLM app.

`ConversationalTestCase`

A ConversationalTestCase represents a multi-turn conversation with an LLM.

Loading image...

Conversational Test Case Architecture

It requires:

turns (list[LLMTestCase]): A list of LLMTestCases representing each interaction in the conversation

And optionally:

[Optional]chatbot_role (str): The role/persona of the LLM chatbot, only required for the RoleAdherenceMetric


from deepeval.test_case import LLMTestCase, ConversationalTestCase
 
test_case = ConversationalTestCase(
    chatbot_role="You are a happy jolly wizard that likes to cast spells and tell jokes",
    turns=[
        LLMTestCase(
            input="Hi! Who are you?",
            actual_output="Ho ho! I'm a jolly wizard who loves casting spells and telling magical jokes! What can I do for you today, my friend?"
        ),
        LLMTestCase(
            input="Can you tell me a joke about magic?",
            actual_output="*waves wand excitedly* Why don't wizards like to argue? Because they don't want to get into a spell-ing contest! *chuckles and sparkles shoot from wand* Get it? Spell-ing? Oh, I do love a good magical pun!"
        ),
    ]
)

Test Cases

Why Test Cases?

Test Case Implementation

LLMTestCase

ConversationalTestCase

Further Reading

`LLMTestCase`

`ConversationalTestCase`