Latency and Cost Tracking

Confident AI lets you track the latency and cost of your LLM calls, which can help you identify inefficiencies in your LLM systems, such as high-cost models or heavy user usage. There are 2 types of cost tracking:

Manual cost tracking: define the token count and per-token costs manually in code
Automatic cost tracking: Confident AI infers the token count and per-token costs based on the model

The @observe decorator automatically tracks span latency. Therefore, this guide will mainly focus on how to set up cost tracking.

Loading video...

LLM Cost and Latency Tracking

0 views • 0 days ago

Confident AI

100K subscribers

Code Summary

Manual Cost Tracking

main.py


from deepeval.tracing import update_current_span, LlmAttributes
 
@observe(
    type="llm", 
    model="gpt-4o", 
    cost_per_input_token=0.001, 
    cost_per_output_token=0.001
)
def generate_response(prompt: str) -> str:
    output = "Generated response"
    update_current_span(
        attributes=LlmAttributes(
            input_token_count=10,
            output_token_count=20,
        )
    )
    return output

Setup Cost Tracking

You can either manually configure cost tracking or let Confident AI calculate your costs automatically from the inputs and outputs of your LLM span based on the provided model.

💡

Automatic cost tracking is only available for OpenAI, Anthropic, and Gemini models.

If token-usage and cost data are provided in code, Confident AI computes the cost directly from those values. If not, it attempts to infer the cost from the model, input, and output by following these steps:

Verify that the model, input, and output are all available and valid.
Select the appropriate tokenizer for the model provider.
Count the input and output tokens.
Retrieve the per-token pricing from the provider.
Compute the total cost.

Automatic Cost Tracking

To set up automatic cost tracking, provide the model in the @observe decorator of your LLM span, and provide the input and output in your LlmAttributes.

main.py


from deepeval.tracing import update_current_span, LlmAttributes
 
@observe(
    type="llm", 
    model="gpt-4o"
)
def generate_response(prompt: str) -> str:
    output = "Generated response"
    update_current_span(
        attributes=LlmAttributes(
            input=prompt,
            output=output,
        )
    )
    return output

The table below summarizes each available model provider and its corresponding tokenization method.

Provider	Tokenizer	Example Models	Token Counting Method
OpenAI	tiktoken	GPT-4, GPT-3.5, O1, O3	Client-side tokenization using model-specific encodings
Anthropic	@anthropic-ai/tokenizer	Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku	Claude-specific tokenization algorithm
Google	Gemini API	Gemini 1.5 Pro, Gemini 2.0 Flash	Server-side token counting via API call

See the OpenAI documentation , Anthropic documentation , or Google documentation for the most up-to-date pricing and token counting information.

Manual Cost Tracking

To manually set up cost tracking, provide the cost_per_input_token and cost_per_output_token in the @observe decorator of your LLM span, and pass the input and output token counts via LlmAttributes.

💡

Manual cost tracking is the recommended approach if you want to ensure accurate cost tracking, and if your model provider does not support automatic cost tracking.

main.py


from deepeval.tracing import update_current_span, LlmAttributes
 
@observe(
    type="llm", 
    model="gpt-4o", 
    cost_per_input_token=0.001, 
    cost_per_output_token=0.001
)
def generate_response(prompt: str) -> str:
    output = "Generated response"
    update_current_span(
        attributes=LlmAttributes(
            input_token_count=10,
            output_token_count=20,
        )
    )
    return output
 
generate_response("Calculate the cost of this call")

The total cost of this call will be computed as:


(input_token_count × cost_per_input_token) + (output_token_count × cost_per_output_token)