Demo & Eval Methodology for Agent Harness Toolkits

TL;DR Harness engineering shifts benchmark scores 20–30+ points without changing the model [1], making the harness layer the decisive variable in any toolkit comparison. Structure a session around three task types (precise tool calls, multi-step planning, error recovery), evaluate at three levels (end-to-end / trajectory / component) [2], and distinguish pass@k (solution exists?) from pass^k (solution is reliable?) [3]. Source your first 20–50 test tasks from real failures — not invented scenarios.

Benchmark Landscape

Six public benchmarks carry signal for harness comparisons in 2026 [1] [17]:

Benchmark	Measures	2026 Leader	Key caveat
SWE-bench Verified	Code-gen agents on real GitHub issues	Claude Opus 4.7 (87.6%)	[1] Contamination confirmed — prefer BenchLM/Epoch AI scores
GAIA	Multi-step reasoning + tool use (466 questions)	Claude Sonnet 4.5 (74.6%)	[1] Same model: 74.6% w/ scaffold vs 44.8% bare — +30 pp
TAU-bench	Enterprise policy adherence, tool-agent-user	—	[1] Penalises correct-but-violating answers; best prod proxy
WebArena	Browser navigation across realistic sites	Claude Mythos Preview (68.7%)	[1] Human baseline ~78%; meaningful gap remains
Terminal-Bench 2.0	89 terminal tasks: file ops, sysadmin, debug	Harness-only: rank 30 → top 5	[5] Best isolator of harness quality vs. model capability
AgentBench	8 diverse environments, aggregated score	—	[1] 70% aggregate can hide 0% on 2 sub-environments

⚠ Berkeley researchers reward-hacked all major benchmarks by April 2026 [16]. Prefer third-party (Epoch AI / BenchLM) evaluation scores. Run a held-out task suite built from your actual workload — leaderboard positions alone are unreliable for toolkit selection.

Evaluation Levels

All three levels must be scored independently [2] [4]. Scoring only end-to-end hides which layer regressed when a test breaks.

Level	What it scores	Grader type
End-to-end	Did the agent accomplish the user’s goal?	[4] Code / human
Trajectory	Did it plan well, select correct tools, recover from errors?	[4] LLM-as-judge
Component	Does each retriever, sub-agent, or tool call perform in isolation?	[4] Deterministic

A unified evaluation framework [14] running tasks as instruction–tool–environment triplets under sandboxed conditions isolates environmental failures (network timeouts, anti-crawling) from genuine agent reasoning failures — up to 37.7% of apparent failures in online evaluation stem from environment, not the agent. Use offline sandbox snapshots for reproducible toolkit comparisons.

Core Metrics

12 primary metrics across four layers, from Anthropic [3] and DeepEval [2]:

Layer	Metric	Type	Production target
Tool use	Tool correctness	Deterministic	[6] > 95%
Tool use	Argument correctness	Deterministic	[6] > 95%
Tool use	Step efficiency	Deterministic	[6] Minimal viable
Planning	Plan quality	LLM judge	[2] Task-specific
Planning	Plan adherence	LLM judge	[2] Task-specific
Output	Task completion rate	Code / human	[6] > 90%
Output	Loop rate	Deterministic	[6] 0%
Safety	Prompt injection	Adversarial	[6] 0 failures
Safety	Policy adherence	LLM judge	[6] Task-specific
Retrieval	Answer relevancy	LLM judge	[2] Task-specific
Retrieval	Faithfulness	LLM judge	[2] Task-specific
Stats	pass@k / pass^k	Statistical	[3] Depends on use

pass@k = probability of success in at least one of k runs — answers “does a solution exist?” pass^k = success in all k runs — answers “is it reliably repeatable?” They diverge sharply above k=3 [3]. Use pass^k for production deployment gates; pass@k for initial capability discovery.

Evaluation Tooling

Six frameworks cover the eval stack for agent harnesses [7]:

Tool	Best for	Trajectory-first?	License
LangSmith	LangChain/LangGraph runtimes	✓	Proprietary
Future AGI	Mixed runtimes, OSS trajectory eval	✓	Apache 2.0
Braintrust	Cross-functional teams, UI-first	✗ (bolted-on)	SaaS
DeepEval ⭐ 16k	CI-native pytest, 50+ metric library	✗	Apache 2.0
Arize Phoenix ⭐ 10k	OpenTelemetry / open standards	Partial	ELv2
OpenAI Evals ⭐ 19k	OpenAI-only stacks, minimal setup	✗	MIT

Selection heuristic [7]: LangGraph shop → LangSmith (native trajectory semantics). Mixed runtime + OSS required → Future AGI. CI as system of record → DeepEval ⭐ 16k (pytest ergonomics, 50+ metrics). Open standards required → Arize Phoenix ⭐ 10k (OTel + OpenInference). ⚠ LLM-as-judge at ~30¢/trace × 3 judges per step compounds fast — gate on deterministic checks first.

Structuring a Live Comparison Session

Without explicit methodology, a framework showdown produces anecdotal results [13]:

1 — Isolate the harness layer. Fix the model (same provider, same version, same system prompt) across all candidates. The session measures harness quality, not model capability.

2 — Select three task categories covering core harness responsibilities [9]:

Tool-call precision — deterministically gradable (right tool, right arguments, first try)
Multi-step planning — 5+ steps carrying context across turns
Error recovery — chaos-inject a tool failure (500 error, malformed JSON) [6]

3 — Declare the demo mode upfront. Audiences tolerate pre-flight; they do not tolerate ambiguity [8]:

Mode	Description	Credibility	Risk
Verified Live	Real-time execution, unscripted	Highest	High
Constrained Live	Real-time within controlled parameters	High	Medium
Pre-flight Live	Agent ran beforehand; replayed with acknowledgment	Medium	Low
Pre-Recorded	Polished video, clearly labelled as such	Lower	Lowest
Aspirational	Explicitly framed as future vision	None (current)	None

The 2023 Gemini video controversy is the canonical example of what happens when mode is unlabelled [8]. Label the mode at the start; production discipline is itself a credibility signal.

4 — Score at all three levels. Capture full execution traces. Use deterministic graders for tool selection; LLM-as-judge for plan quality and recovery behaviour. Record token cost and wall-clock latency — framework choice routinely varies cost 5–6× [1].

5 — Report per-dimension, not aggregate. A single composite score hides failure modes [7]. Present the breakdown table; let each dimension speak independently.

Skill-Pack Specific Evaluation

When the comparison target is the skill/workflow/discipline-pack layer rather than the full orchestration framework:

Utility vs. security tradeoff. SkillTester [10] runs an 86-task benchmark across 11 domains and finds that higher utility consistently correlates with broader permissions and greater security exposure. Score utility and security as independent axes — collapsing them into a single metric obscures the tradeoff.

Task-adaptive rubrics. Fixed rubrics apply generic criteria across all tasks, missing domain-specific requirements. AdaRubric [11] generates evaluation dimensions from the task itself, improving inter-annotator agreement and producing better reward model training signals — especially important for diverse skill packs where a single rubric fails to capture domain requirements.

Skill retrieval at scale. As skill libraries grow, retrieval quality determines task success. SkillFlow [12] shows embedding-based retrieval substantially outperforms keyword matching for complex queries. Track Recall@K and MRR as library size scales.

Six Framework Comparison Dimensions

For comparing full harness frameworks, these six dimensions are the primary scoring axes [13]:

Architecture and core abstractions — state machines vs. reactive graphs vs. role-based crews
State management — persistence, durability, human-in-the-loop checkpoints
Tool integration and MCP support — protocol compatibility, scaling tool inventories
Multi-agent orchestration — supervisor, peer, and pipeline coordination patterns
Observability and debugging — trace replay, visual state inspection
Production readiness — enterprise deployments, community maturity, licensing

No framework dominates all six. The primary trade-off axis is control vs. simplicity: CrewAI deploys multi-agent setups 40% faster but adds ~18% token overhead compared to LangGraph [15]. Report all six dimensions separately; no composite score.

Common Pitfalls

Scoring final output only → misses which trajectory step regressed [7]
Relying solely on public benchmarks → misses your tool registry, error codes, business policy [7]
Reporting aggregate scores → 70% overall can hide 0% on two sub-environments [1]
Shared state between test runs → each trial needs a clean environment; correlated failures inflate pass rates [3]
Unlabelled demo mode → ambiguity between live and pre-recorded destroys credibility [8]
Inventing test cases → source tasks from actual bug trackers and real failures [3]