Online Evaluation

You can run online evaluations in production, on-the-fly, by running metrics on individual:

Spans
Traces, and
Threads

You can do this by providing the name of the metric collection you’ve created on Confident AI in the @obesrve decorators during tracing.

Create Metric Collection

To enable referenceless metrics to run in production, you will need to create a metric collection.

If you’re planning to run online evals on threads, create a multi-turn metric collection, and vice versa. This is really important as you’ll needs the metric collection names to match as you’ll see in the next section.

💡

Only referenceless metric you’ve enabled inside the metric collection runnable upon tracing. For non-referenceless metrics, Confident AI will simply ignore them.

To recap, referenceless metrics section, referenceless metrics are a special type of metric that can evaluate your LLM’s performance without requiring reference data (like expected_output or expected_tools).

Online Evals for Spans

Simply provide the name of the metric collection to tell Confident AI the specify set of referenceless metrics you wish to run in the @observe decorator.

You’ll also need to use update_current_span with an LLMTestCase at runtime to actually trigger an online evaluation on the server-side:

Python

main.py


from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_span
 
@observe(metricCollection="My Collection")
def llm_app(query: str) -> str:
    res = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": query}
        ]
    ).choices[0].message["content"]
 
    update_current_span(test_case=LLMTestCase(input=query, actual_output=res))
  
llm_app("Write me a poem.")

The metricCollection argument is an optional strings that determines which metrics in your online metric collection will be ran for this current span.

💡

Supplying a metric name in metrics that doesn’t exist or isn’t activated on Confident AI will result in it failing silently. If metrics aren’t showing up on the platform, make sure the names align perfectly. (PS. Watch out for trailing spaces!)

If you specify a metricCollection list but don’t update your current span with the sufficient test case parameters for metric execution, it will simply show up as an error on Confident AI, and won’t block or cause issues in your code.

Online Evals for Traces

Running evals on traces are akin to running end-to-end evals, where you disregard the performance of individual spans within the trace and treat your application as a black-box.

💡

You can run online evals on both traces and spans at the same time.

Similar to evals for spans, you would also provide a metricCollection name, but this time call the update_current_trace() function instead:

Python

main.py


from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_trace
 
@observe(metricCollection="My Collection")
def llm_app(query: str) -> str:
    res = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": query}
        ]
    ).choices[0].message["content"]
 
    update_current_trace(test_case=LLMTestCase(input=query, actual_output=res))
  
llm_app("Write me a poem.")

Also note that unlike evals on spans, the metricCollection MUST BE DEFINED at the top-level/root span level. You can call update_current_trace anywhere in your observed application though.

Online Evals for Threads

Similar to traces and spans, Confident AI will only run online evals on threads whenever a multi-turn interaction has completed.

However, since it is impossible for Confident AI to automatically know whether a multi-turn conversation has completed or not, you’ll have to trigger an online evaluation using the evaluate_thread() method ONLY AFTER once you’re certain a conversation has completed:

main.py


import openai
from deepeval.tracing import observe, update_current_trace, evaluate_thread
 
your_thread_id = "your-thread-id"
 
@observe()
def llm_app(query: str):
    res = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": query}
        ]
    ).choices[0].message["content"]
 
    update_current_trace(thread_id=your_thread_id, input=query, output=res)
    return res
 
if __name__ == "__main__":
    user_query = input("💬 Enter your prompt: ")
    response = llm_app(user_query, true)
    print(f"\n🤖 Response:\n{response}")
 
 evaluate_thread(thread_id=your_thread_id, metricCollection="Collection Name")

You can also use a_evaluate_thread, the async version of evaluate_thread():

main.py


...
 
a_evaluate_thread(thread_id=your_thread_id, metricCollection="Collection Name")

You MUST set the input/output of individual traces in a thread for multi-turn evaluation to work online. To recap, DeepEval uses the input of a trace as the "user" role content and the output of a trace as the "assistant" role content as turns in your thread.

If you don’t set the input and/or output, Confident AI will have nothing in your thread to evaluate.

Examples

Trace and span evals

Quick quiz: Given the code below, will Confidnet AI run online evaluations on the trace using metrics in "Collection 2" or "Collection 1"?

main.py


from deepeval.tracing import observe, update_current_span, update_current_trace
 
@observe(metricCollection="Collection 1")
def outer_function():
    @observe(metricCollection="Collection 2")
    def inner_function():
        update_current_span(test_case=...)
        update_current_trace(test_case=...)

Answer: This will run "Collection 1" for traces, and "Collection 2" for spans.

This is because in this example:

When outer_function starts, it creates the “outer” span
When inner_function is called, it creates the “inner” span on top
Any calls to update_current_span() during inner_function’s execution will update the “inner” span, not the “outer” one
Any calls to update_current_trace() during any point inside outer_function will update the entire trace and online evals for traces MUST BE SET on the root level span.

Thread and trace evals

Quick quiz: Given the code below, will Confident AI run online evaluations on the thread using metrics in "Collection 2" or "Collection 1"?

main.py


from deepeval.tracing import observe, update_current_trace, evaluate_thread
 
your_thread_id = "your-thread-id"
 
@observe(metricCollection="Collection 1")
def outer_function():
    update_current_trace(thread_id=your_thread_id, test_case=...)
 
evaluate_thread(thread_id=your_thread_id, metricCollection="Collection 2")

Answer: This will NOT run "Collection 1" or "Collection 2", because neither the input or output has been specified in update_current_trace. This means Confident AI will have no turns to evaluate using metrics in your metric collection.

⚠️

Note that setting the test_case for a trace has no bearing on the input and output.

Online Evaluation

Create Metric Collection

Online Evals for Spans

Python

TypeScript

Online Evals for Traces

Python

TypeScript

Online Evals for Threads

Examples

Trace and span evals

Thread and trace evals