Skip to Content
Confident AI is free to try . No credit card required.
Docs
LLM Tracing
IntegrationsOpenAI

OpenAI

This tutorial will show you how to trace your OpenAI API calls on Confident AI Observatory.

Quickstart

Install the following packages:

pip install -U deepeval openai

Login using your API key on Confident AI in the CLI:

deepeval login --confident-api-key YOUR_API_KEY

To begin tracing your OpenAI API calls, import OpenAI from deepeval.openai instead of openai.

import time from deepeval.openai import OpenAI from deepeval.tracing import observe, trace_manager client = OpenAI() @observe(type="llm") def generate_response(input: str) -> str: response = client.chat.completions.create( model="gpt-4.1", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": input}, ], ) return response response = generate_response("What is the weather in Tokyo?")

The above code will automatically capture the following information and send it to Observatory (no need to set the values of LlmAttributes):

  • model: Name of the OpenAI model
  • input: Input messages
  • output: Output messages
  • tool_calls: List of tool calls by the OpenAI model
  • input_token_count: Input token count
  • output_token_count: Output token count
💡

Read more about the LLM spans and its attributes here.

We use Monkey Patching under the hood which dynamically wraps chat.completions.create, beta.chat.completions.parse, and responses.create methods of OpenAI client at runtime, preserving the original method signature.

Using OpenAI in Component-level Evaluations

You can also use DeepEval’s OpenAI client to run component-level evaluations locally. To do this, replace your existing OpenAI client, pass in metrics, and invoke your LLM application under the dataset generator.

What metrics are supported?

DeepEval’s OpenAI client populates an LLMTestCase for each call to chat.completions.create and responses.create with the input, output and tools_called parameters, so any metrics that only require these parameters will work out of the box.

If you need to run metrics that require other test case parameters such as the expected_output, you can pass them as arguments to DeepEval’s OpenAI methods.

response = client.chat.completions.create( ..., expected_output="...", retrieval_context=["...", "..."] )
import time from deepeval.openai import OpenAI from deepeval.tracing import observe, trace_manager from deepeval.metrics import AnswerRelevancyMetric from deepeval.dataset import Golden from deepeval.evaluate import dataset client = OpenAI() @observe(type="llm") def generate_response(input: str) -> str: response = client.chat.completions.create( model="gpt-4.1", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": input}, ], metrics=[AnswerRelevancyMetric()] ) return response # Create goldens goldens = [ Golden(input="What is the weather in Bogotá, Colombia?"), Golden(input="What is the weather in Paris, France?"), ] # Run component-level evaluation for golden in dataset(goldens=goldens): generate_response(golden.input)

End-to-End Evaluations

To run end-to-end evaluations direclty on your OpenAI client, use the dataset generator to call OpenAI’s chat.completions.create or responses.create for each golden.

from deepeval.tracing import observe from deepeval.openai import OpenAI from deepeval.evaluate import dataset from deepeval.metrics import AnswerRelevancyMetric, BiasMetric from deepeval.dataset import Golden client = OpenAI() for golden in dataset(alias="Your Dataset Name"): # run OpenAI client client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": golden.input} ], metrics=[AnswerRelevancyMetric(), BiasMetric()] )
Last updated on