Score by Model (higher is better)
Cost per Run (USD, lower is better)
Median Wall Clock (s, lower is better)
Output TPS (higher is better)
Tool Calls by Model (lower is better)
Output Tokens by Model (contextual)
Input Tokens by Model (lower is better)
Judge Disagreement (σ, lower is better)