Online Evaluation
You can run online evaluations in production, on-the-fly, by running metrics on individual:
- Spans
- Traces, and
- Threads
You can do this by providing the name of the metric collection you’ve created on Confident AI in the @obesrve
decorators during tracing.
Create Metric Collection
To enable referenceless metrics to run in production, you will need to create a metric collection.
If you’re planning to run online evals on threads, create a multi-turn metric collection, and vice versa. This is really important as you’ll needs the metric collection names to match as you’ll see in the next section.
Only referenceless metric you’ve enabled inside the metric collection runnable upon tracing. For non-referenceless metrics, Confident AI will simply ignore them.
To recap, referenceless metrics section, referenceless metrics are a special type of metric that can evaluate your LLM’s performance without requiring reference data (like expected_output
or expected_tools
).
Online Evals for Spans
Simply provide the name of the metric collection to tell Confident AI the specify set of referenceless metrics you wish to run in the @observe
decorator.
You’ll also need to use update_current_span
with an LLMTestCase
at runtime to actually trigger an online evaluation on the server-side:
Python
from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_span
@observe(metricCollection="My Collection")
def llm_app(query: str) -> str:
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "user", "content": query}
]
).choices[0].message["content"]
update_current_span(test_case=LLMTestCase(input=query, actual_output=res))
llm_app("Write me a poem.")
The metricCollection
argument is an optional strings that determines which metrics in your online metric collection will be ran for this current span.
Supplying a metric name in metrics
that doesn’t exist or isn’t activated on
Confident AI will result in it failing silently. If metrics aren’t showing up
on the platform, make sure the names align perfectly. (PS. Watch out for
trailing spaces!)
If you specify a metricCollection
list but don’t update your current span with the sufficient test case parameters for metric execution, it will simply show up as an error on Confident AI, and won’t block or cause issues in your code.
Online Evals for Traces
Running evals on traces are akin to running end-to-end evals, where you disregard the performance of individual spans within the trace and treat your application as a black-box.
You can run online evals on both traces and spans at the same time.
Similar to evals for spans, you would also provide a metricCollection
name, but this time call the update_current_trace()
function instead:
Python
from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_trace
@observe(metricCollection="My Collection")
def llm_app(query: str) -> str:
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "user", "content": query}
]
).choices[0].message["content"]
update_current_trace(test_case=LLMTestCase(input=query, actual_output=res))
llm_app("Write me a poem.")
Also note that unlike evals on spans, the metricCollection
MUST BE DEFINED at the top-level/root span level. You can call update_current_trace
anywhere in your observed application though.
Online Evals for Threads
Similar to traces and spans, Confident AI will only run online evals on threads whenever a multi-turn interaction has completed.
However, since it is impossible for Confident AI to automatically know whether a multi-turn conversation has completed or not, you’ll have to trigger an online evaluation using the evaluate_thread()
method ONLY AFTER once you’re certain a conversation has completed:
import openai
from deepeval.tracing import observe, update_current_trace, evaluate_thread
your_thread_id = "your-thread-id"
@observe()
def llm_app(query: str):
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "user", "content": query}
]
).choices[0].message["content"]
update_current_trace(thread_id=your_thread_id, input=query, output=res)
return res
if __name__ == "__main__":
user_query = input("💬 Enter your prompt: ")
response = llm_app(user_query, true)
print(f"\n🤖 Response:\n{response}")
evaluate_thread(thread_id=your_thread_id, metricCollection="Collection Name")
You can also use a_evaluate_thread
, the async
version of evaluate_thread()
:
...
a_evaluate_thread(thread_id=your_thread_id, metricCollection="Collection Name")
You MUST set the input
/output
of individual traces in a thread for multi-turn evaluation to work online. To recap, DeepEval uses the input
of a trace as the "user"
role content and the output
of a trace as the "assistant"
role content as turns in your thread.
If you don’t set the input
and/or output
, Confident AI will have nothing in your thread to evaluate.
Examples
Trace and span evals
Quick quiz: Given the code below, will Confidnet AI run online evaluations on the trace using metrics in "Collection 2"
or "Collection 1"
?
from deepeval.tracing import observe, update_current_span, update_current_trace
@observe(metricCollection="Collection 1")
def outer_function():
@observe(metricCollection="Collection 2")
def inner_function():
update_current_span(test_case=...)
update_current_trace(test_case=...)
Answer: This will run "Collection 1"
for traces, and "Collection 2"
for spans.
This is because in this example:
- When
outer_function
starts, it creates the “outer” span - When
inner_function
is called, it creates the “inner” span on top - Any calls to
update_current_span()
duringinner_function
’s execution will update the “inner” span, not the “outer” one - Any calls to
update_current_trace()
during any point insideouter_function
will update the entire trace and online evals for traces MUST BE SET on the root level span.
Thread and trace evals
Quick quiz: Given the code below, will Confident AI run online evaluations on the thread using metrics in "Collection 2"
or "Collection 1"
?
from deepeval.tracing import observe, update_current_trace, evaluate_thread
your_thread_id = "your-thread-id"
@observe(metricCollection="Collection 1")
def outer_function():
update_current_trace(thread_id=your_thread_id, test_case=...)
evaluate_thread(thread_id=your_thread_id, metricCollection="Collection 2")
Answer: This will NOT run "Collection 1"
or "Collection 2"
, because neither the input
or output
has been specified in update_current_trace
. This means Confident AI will have no turns to evaluate using metrics in your metric collection.
Note that setting the test_case
for a trace has no bearing on the input
and output
.