Evaluation

LLM Evals: Everything You Need to Know

• Sunday, February 8, 2026

A comprehensive guide to LLM evals, drawn from questions asked in our popular course on AI Evals. Covers everything from basic to advanced topics.

5 Best AI Agent Observability Tools for Agent Reliability in 2026

A critical review of the top 2026 platforms (Braintrust, Vellum, Fiddler, Helicone, Galileo). It highlights the move toward 'time-travel debugging' and integrating automated evaluations directly into the production trace pipeline.

Tuesday, January 20, 2026

Choosing and Operating Tabular Models Inside AI Agents

AI agents that make decisions over structured data rely heavily on tabular learning models — but model choice has direct implications for agent reliability, routing, and operational behavior. In this benchmark, we evaluate 7 widely used tabular model families across 19 real-world datasets (~260k rows, 250+ features) to understand which models agents should invoke under different data regimes. Rather than focusing solely on average rank, we analyze win rates to capture dominance — a critical signal when agents must choose models dynamically at runtime. The results reveal that: Foundation models are most effective for agents operating with limited data XGBoost is the most reliable choice for large, numeric-heavy workloads Hybrid datasets at scale remain operationally ambiguous, with multiple viable model choices These findings highlight a core Agent Ops challenge: model selection and routing inside agents is a runtime decision, not a one-time architecture choice. As agents increasingly combine LLM reasoning with structured prediction, understanding the operational strengths and failure modes of tabular models becomes essential for building robust, cost-aware agent systems.

Saturday, February 7, 2026

Comparing RAG Evaluation Tools

RAG systems experience failures caused by retrieval poisoning. This analysis evaluates six RAG evaluation frameworks on their ability to detect deceptive negatives, focusing on relevance scoring, ranking metrics, adversarial safety, and how evaluation tooling and prompt design affect agent reliability.

Saturday, February 7, 2026

Related Articles

5 Best AI Agent Observability Tools for Agent Reliability in 2026

Choosing and Operating Tabular Models Inside AI Agents

Comparing RAG Evaluation Tools