Online Evaluation

You can define online metrics on a span level by providing the names of the metric enabled for monitoring on Confident AI in the @obesrve decorator

Enable Metric Collection for Monitoring

To enable referenceless metrics to run in production, you will need to create a metric collection, and press the Enable for Monitoring button for that metric collection.

This will make every referenceless metric, including custom metrics, you’ve enabled inside the metric collection runnable upon receiving the trace you’ve logged. For non-referenceless metrics, Confident AI will simply ignore them.

Specify Metrics and Create Test Case

Specify which metrics inside your online metric collection should be ran by supplying it as an argument in the @observe decorator. Don’t forget to also update your current span with a test case at runtime to actually run an online evaluation:

Python


from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_span
 
@observe(metrics=["Answer Relevancy"])
def llm_app(query: str) -> str:
    res = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": query}
        ]
    ).choices[0].message["content"]
 
    update_current_span(
        test_case=LLMTestCase(input=query, actual_output=res)
    )
  
llm_app("Write me a poem.")

The metrics argument is an optional list of strings that determines which metrics in your online metric collection will be ran for this current span. The test case parameters on the other hand maps 1-1 to those parameters in an LLMTestCase. You should definitely read this if you’re not sure what an LLMTestCase is.

💡

Supplying a metric name in metrics that doesn’t exist or isn’t activated on Confident AI will result in it failing silently. If metrics aren’t showing up on the platform, make sure the names align perfectly. (PS. Watch out for trailing spaces!)

Not setting the correct test case parameters isn’t the end of the world. If you specify an enabled online metric in the metrics list but don’t update your current span with the sufficient test case parameters for metric execution, it will simply show up as an error on Confident AI.

Advanced Example

If you’re also wondering, when you have nested spans (one function decorated with @observe calling another decorated function), the system always works with the most recently started span. Think of it like a stack of plates - you always work with the top plate.


from deepeval.tracing import observe, update_current_span
 
@observe(type="custom", name="outer")
def outer_function():
    @observe(type="llm", name="inner", metrics=["Your metric name"])
    def inner_function():
        # Here, update_current_span() will update the LLM span
        update_current_span(test_case=...)

In this example:

When outer_function starts, it creates the “outer” span
When inner_function is called, it creates the “inner” span on top
Any calls to update_current_span() during inner_function’s execution will update the “inner” span, not the “outer” one
This ensures that metrics and attributes are always applied to the correct, most specific span in your trace

Therefore, it is the inner_function() that will have the online metrics evaluated NOT the outer_function().

Recap on Referenceless Metrics

As we discussed in the referenceless metrics section, referenceless metrics are a special type of metric that can evaluate your LLM’s performance without requiring reference data (like expected_output or expected_tools). This makes them particularly valuable for production monitoring where you typically don’t have access to ground truth data.

In production, we call these referenceless metrics “online metrics” because they run in real-time as your application processes requests. The key advantages of using referenceless metrics in production are:

Real-time monitoring: Evaluate your LLM’s performance as it processes actual user requests
No reference data needed: Works without requiring annotated datasets or ground truth data
Immediate feedback: Get instant insights into your application’s performance
Scalable evaluation: Can handle high volumes of requests without manual annotation