Hyperparameters
Hyperparameters in Confident AI are key-value pairs of type Dict[str, Union[str, Prompt]]
that define what parameters are used in your LLM application.
For example, a key-value pair of "model": "gpt-4o"
allows you to see that
this was the model you used for generating the actual_output
s for a
particular test run.
Some hyperparameters include:
- Prompts: The version of the prompt template you’re using
- Models: The specific LLM models being used (e.g.,
gpt-4o
,claude-3.5-sonnet
, etc.) - Embedders: The model used to create vector representations for retrieval
- etc.
Associating hyperparameters to test runs is crucial for:
- Comparing different configurations through A/B experimentation
- Reproducing results consistently
- Understanding which settings work best for your use case
You can edit prompts, both text and a list of messages, on Confident AI’s prompt studio.
Types of Hyperparameters
Although you can have as many combinations of hyperparameters you wish for a test run, there are only two types of hyperparameters on Confident AI:
- Hyperparameters that are prompts
- Hyperparameters that are NOT prompts
When a hyperparameter is a prompt, you can leverage Confident AI’s prompt studio to version, edit, and management them in one centralized place. This allows Confident AI to tell you which prompt version performs best for your use case.
When a hyperparameter is not a prompt, there won’t be any versioning. For example, a model is not a prompt, and you can simply log different model names (e.g. claude-3.5-sonnet
, gpt-4
, llama-2
) to compare their performance across test runs.
The reason why we need to make the distinction between prompt and not a prompt, is because prompts contains a lot of text, is edited freqently, and so need a better abstraction to make the most out of.
Prompts
Why prompt versioning?
Prompt versioning is a critical feature for optimizing your LLM applications. By running evaluations with different prompt versions, you can:
- Quantify improvements: Measure how each version performs against your evaluation metrics
- Identify optimal versions: Determine which prompt version produces the best results for your specific use case
- Track evolution: Document how your prompts have evolved and what changes led to improvements
- Compare trade-offs: Understand the balance between different aspects of performance (e.g., accuracy vs. conciseness)
Confident AI automatically tracks which prompt version was used for each test run, allowing you to correlate prompt changes with performance metrics. This data-driven approach to prompt engineering helps you make informed decisions about which prompt version to use in production.
Confident AI provides a Prompt Studio for managing your prompt templates:
- Versioning: Track changes to your prompts over time
- A/B Testing: Compare different prompt versions to see which performs better
- Centralized Storage: Keep all your prompts in one place
- Collaboration: Share prompts with team members
Prompts are either text OR a list of text templates that guide LLM behavior by providing instructions, context, and structure for generating responses. A prompt template contains variables that are interpolated at runtime with dynamic values to create the final prompt sent to the LLM.
For example, a simple prompt template might look like:
You are a helpful AI assistant. Answer the following question based on the provided context:
Context: {context}
Question: {question}
Answer:
When this template is used, the variables {context}
and {question}
are
replaced with actual values to create the final prompt.
On the contrary, a list of prompt template messages looks like this:
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi my name is {{ name }}. What is the capital of France?"}
]
Which you can provide to your LLM provider such as OpenAI directly.
Prompt Versioning vs Prompts
The difference is subtle, with both terms often used interchangeably, and you shouldn’t worry too much about this.
While a prompt is a text template that guides LLM behavior, prompt versioning refers to the process of tracking and managing different iterations of that prompt over time. Think of it this way:
- A prompt is a single template with variables that gets filled with values at runtime
- A prompt version is a variation of that same prompt that serves the same purpose but is written differently to improve performance
For example, consider a system prompt that defines the personality and capabilities of your AI assistant:
Version 1.0:
You are a helpful AI assistant. Provide concise, accurate answers to user questions.
Version 1.1:
You are a helpful AI assistant with expertise in technical topics. Provide concise, accurate answers to user questions, with occasional technical details when relevant.
Version 1.2:
You are a helpful AI assistant with expertise in technical topics. Provide concise, accurate answers to user questions, with occasional technical details when relevant. Always cite sources when providing information.
Each version represents a distinct iteration of the same prompt, with incremental changes that aim to improve performance.
Models
Models are a non-prompt hyperparameter and although is no different than any other non-prompt hyperparameter such as chunk size, top-K, etc. it is worth a mention since it is the most common hyperparameter you will probably be logging.
Logging models is crucial for tracking performance differences between different LLM providers and versions, allowing you to make data-driven decisions about which model performs best for your specific use case.
It is important to note that the key of a "model"
hyperparameter is entirely up for you to decide. You can set the key to log models as "my llm"
, and it would still work perfectly fine on Confident AI.
Others
Beyond prompts and models, there are numerous other hyperparameters you might want to track in your LLM applications.
These include RAG-specific parameters like chunk size, top-K, embedder models, and rerankers; agent-related settings such as available tools and tool selection strategies; and generation parameters like temperature, max tokens, and sampling settings. By logging these alongside your test results, you can systematically identify which configurations produce the best outcomes for your specific use case.
Best Practices
When logging hyperparameters, it’s important to distinguish between true hyperparameters and the results of those hyperparameters. Here are some best practices:
-
Do not log retrieved documents as hyperparameters: In RAG systems, the documents retrieved are a result of your hyperparameters (like embedding model, chunk size, etc.), not hyperparameters themselves. Logging these as hyperparameters will pollute your testing data and make it difficult to draw meaningful conclusions.
-
Focus on configurable parameters: Only log parameters that you can intentionally change to improve your system’s performance.
-
Be consistent with naming: Use consistent keys for the same hyperparameters across different test runs to ensure proper comparison.
-
Document your hyperparameters: Keep a record of what each hyperparameter controls and its expected impact on performance.
-
Avoid logging sensitive information: Never include API keys, passwords, or other sensitive data as hyperparameters.
Following these practices will help you maintain clean, meaningful test data that accurately reflects the impact of your hyperparameter choices on your LLM application’s performance.
What About Complex Agentic Systems?
In complex agentic systems with multiple components, it’s essential to use distinct hyperparameter keys for each component. For example, if you have a “supervisor” agent that delegates tasks to specialized agents, each with its own embedding model for retrieval, you should use different keys like "supervisor_embedder"
and "specialist_embedder"
rather than a generic "embedder"
key.
This approach:
- Prevents confusion when analyzing results
- Allows you to track performance of individual components
- Makes it easier to identify which specific component’s hyperparameters need adjustment
These “components”, can be conceptualized as spans within a trace of your application’s execution. A trace represents the complete call hierarchy and structure of your LLM application for a single request, while spans are the individual operations (like LLM calls, retrieval steps, or tool usage) that make up that trace.
LLM tracing on Confident AI allows you to monitor performance, track hyperparameter impacts across different components, and identify optimization opportunities in complex systems.