Agent Evaluation
Warning
The codebase for evaluation is under development and is not yet stable. Use with caution, we welcome contributions.
Evaluation using any_agent.evaluation is designed to be a "trace-first" evaluation. The evaluation of a trace is not designed to be pass/fail, but is rather a score based on the achievement of user-defined criteria for each example. Agent systems are hyper-specific to each use case, and it's difficult to provide a single set of metrics that would reliably provide the insight needed to make a decision about the effectiveness of an agent.
Using any-agent evaluation, you can specify any criteria you wish, and through the LLM-as-a-judge technique, any-agent will evaluate which criteria are satisfied.
Example
Using the unified tracing format provided by any-agent's tracing functionality, the trace can be evaluated with user defined criteria. The steps for evaluating an agent are as follows:
Run an agent using any-agent, which will produce a trace. For example
from any_agent import AgentConfig, AnyAgent, TracingConfig
from any_agent.tools import search_web
agent = AnyAgent.create(
"langchain",
AgentConfig(
model_id="gpt-4o-mini",
tools=[search_web]
),
tracing=TracingConfig(console=True, cost_info=False)
)
agent_trace = agent.run("How many seconds would it take for a leopard at full speed to run through Pont des Arts?")
Define an evaluation case either in a yaml file or in python:
# The criteria will be passed to an llm-as-a-judge along with the trace to have as context
# The points specify the weight given to each criteria when producing the final score
llm_judge: openai/gpt-4o
checkpoints:
- points: 1
criteria: Ensure that the agent called the search_web tool in order to retrieve the length of Pont des Arts
- points: 1
criteria: Ensure that the agent called the search_web tool in order to access the top speed of a leopard
- points: 1
criteria: |
Ensure that the agent ran a python snippet to combine the information
from the info retrieved from the web searches
# Optionally, you can check whether the final answer is what was expected. Checking this value does not use an LLM
ground_truth:
- name: Time
points: 5
value: 9.63
from any_agent.evaluation.evaluation_case import EvaluationCase
evaluation_case = EvaluationCase(
ground_truth=[{"name": "Seconds", "value": "9", "points": 1.0}],
checkpoints=[
{"criteria": "Did the agent run a calculation", "points": 1},
{"criteria": "Did the agent use fewer than 5 steps", "points": 4},
],
llm_judge="gpt-4o-mini",
)
Run the evaluation using the test case and trace.
from any_agent.evaluation import evaluate, EvaluationCase
evaluation_case = EvaluationCase(
ground_truth=[{"name": "Test Case 1", "value": 1.0, "points": 1.0}],
checkpoints=[{"criteria": "Check if value is 1.0", "points": 1}],
llm_judge="gpt-4o-mini",
)
eval_result = evaluate(
evaluation_case=evaluation_case,
trace=agent_trace,
agent_framework="OPENAI",
)
print(f"Final score: {eval_result.score}")
Command Line
If you have the file and test case prepared, a command line tools is provided for convenience called any-agent-evaluate
.
It can be called like so