Online Evaluation
You can define online metrics on a span level by providing the names of the metric enabled for monitoring on Confident AI in the @obesrve
decorator
Enable Metric Collection for Monitoring
To enable referenceless metrics to run in production, you will need to create a metric collection, and press the Enable for Monitoring button for that metric collection.
This will make every referenceless metric, including custom metrics, you’ve enabled inside the metric collection runnable upon receiving the trace you’ve logged. For non-referenceless metrics, Confident AI will simply ignore them.
Specify Metrics and Create Test Case
Specify which metrics inside your online metric collection should be ran by supplying it as an argument in the @observe
decorator. Don’t forget to also update your current span with a test case at runtime to actually run an online evaluation:
Python
from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_span
@observe(metrics=["Answer Relevancy"])
def llm_app(query: str) -> str:
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "user", "content": query}
]
).choices[0].message["content"]
update_current_span(
test_case=LLMTestCase(input=query, actual_output=res)
)
llm_app("Write me a poem.")
The metrics
argument is an optional list of strings that determines which metrics in your online metric collection will be ran for this current span. The test case parameters on the other hand maps 1-1 to those parameters in an LLMTestCase
. You should definitely read this if you’re not sure what an LLMTestCase
is.
Supplying a metric name in metrics
that doesn’t exist or isn’t activated on
Confident AI will result in it failing silently. If metrics aren’t showing up
on the platform, make sure the names align perfectly. (PS. Watch out for
trailing spaces!)
Not setting the correct test case parameters isn’t the end of the world. If
you specify an enabled online metric in the metrics
list but don’t update
your current span with the sufficient test case parameters for metric
execution, it will simply show up as an error on Confident AI.
Advanced Example
If you’re also wondering, when you have nested spans (one function decorated with @observe
calling another decorated function), the system always works with the most recently started span. Think of it like a stack of plates - you always work with the top plate.
from deepeval.tracing import observe, update_current_span
@observe(type="custom", name="outer")
def outer_function():
@observe(type="llm", name="inner", metrics=["Your metric name"])
def inner_function():
# Here, update_current_span() will update the LLM span
update_current_span(test_case=...)
In this example:
- When
outer_function
starts, it creates the “outer” span - When
inner_function
is called, it creates the “inner” span on top - Any calls to
update_current_span()
duringinner_function
’s execution will update the “inner” span, not the “outer” one - This ensures that metrics and attributes are always applied to the correct, most specific span in your trace
Therefore, it is the inner_function()
that will have the online metrics evaluated NOT the outer_function()
.
Recap on Referenceless Metrics
As we discussed in the referenceless metrics section, referenceless metrics are a special type of metric that can evaluate your LLM’s performance without requiring reference data (like expected_output
or expected_tools
). This makes them particularly valuable for production monitoring where you typically don’t have access to ground truth data.
In production, we call these referenceless metrics “online metrics” because they run in real-time as your application processes requests. The key advantages of using referenceless metrics in production are:
- Real-time monitoring: Evaluate your LLM’s performance as it processes actual user requests
- No reference data needed: Works without requiring annotated datasets or ground truth data
- Immediate feedback: Get instant insights into your application’s performance
- Scalable evaluation: Can handle high volumes of requests without manual annotation