Better-Harness: Using Evals to Iteratively Improve Agent Harnesses
Use evaluation-driven feedback loops to iteratively improve agent harnesses and achieve better generalization in production. Better-Harness treats evals as training data for agents, where each test case provides a learning signal to optimize prompts, tools, and workflows. The system combines curated eval sourcing (hand-written cases, production traces, external datasets), structured tagging for behavioral coverage, and holdout sets to prevent overfitting. It introduces a compound system approach—data sourcing, experiment design, optimization, and human review—to continuously refine agent performance. Key practices include mining production traces for failures, using tagged eval subsets for cost-efficient testing, and pairing automated improvements with human validation to avoid reward hacking and ensure real-world reliability.