OpenAI
This tutorial will show you how to trace your OpenAI API calls on Confident AI Observatory.
Quickstart
Install the following packages:
pip install -U deepeval openai
Login using your API key on Confident AI in the CLI:
deepeval login --confident-api-key YOUR_API_KEY
To begin tracing your OpenAI API calls, import OpenAI
from deepeval.openai
instead of openai
.
Chat Completions
import time
from deepeval.openai import OpenAI
from deepeval.tracing import observe, trace_manager
client = OpenAI()
@observe(type="llm")
def generate_response(input: str) -> str:
response = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": input},
],
)
return response
response = generate_response("What is the weather in Tokyo?")
The above code will automatically capture the following information and send it to Observatory (no need to set the values of LlmAttributes
):
model
: Name of the OpenAI modelinput
: Input messagesoutput
: Output messagestool_calls
: List of tool calls by the OpenAI modelinput_token_count
: Input token countoutput_token_count
: Output token count
Read more about the LLM spans and its attributes here .
We use Monkey Patching under the hood which dynamically wraps chat.completions.create
, beta.chat.completions.parse
, and responses.create
methods of OpenAI client at runtime, preserving the original method signature.
Using OpenAI in Component-level Evaluations
You can also use DeepEval’s OpenAI client to run component-level evaluations locally. To do this, replace your existing OpenAI
client, pass in metrics
, and invoke your LLM application under the dataset
generator.
What metrics are supported?
DeepEval’s OpenAI client populates an LLMTestCase
for each call to chat.completions.create
and responses.create
with the input
, output
and tools_called
parameters, so any metrics that only require these parameters will work out of the box.
If you need to run metrics that require other test case parameters such as the expected_output
, you can pass them as arguments to DeepEval’s OpenAI methods.
response = client.chat.completions.create(
...,
expected_output="...",
retrieval_context=["...", "..."]
)
Chat Completions
import time
from deepeval.openai import OpenAI
from deepeval.tracing import observe, trace_manager
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import Golden
from deepeval.evaluate import dataset
client = OpenAI()
@observe(type="llm")
def generate_response(input: str) -> str:
response = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": input},
],
metrics=[AnswerRelevancyMetric()]
)
return response
# Create goldens
goldens = [
Golden(input="What is the weather in Bogotá, Colombia?"),
Golden(input="What is the weather in Paris, France?"),
]
# Run component-level evaluation
for golden in dataset(goldens=goldens):
generate_response(golden.input)
End-to-End Evaluations
To run end-to-end evaluations direclty on your OpenAI client, use the dataset
generator to call OpenAI’s chat.completions.create
or responses.create
for each golden.
Chat Completions
from deepeval.tracing import observe
from deepeval.openai import OpenAI
from deepeval.evaluate import dataset
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
from deepeval.dataset import Golden
client = OpenAI()
for golden in dataset(alias="Your Dataset Name"):
# run OpenAI client
client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": golden.input}
],
metrics=[AnswerRelevancyMetric(), BiasMetric()]
)