Benchmarks
Replay historical objectives against a new prompt or model to measure regressions.
What it does#
Pick a curated set of past objectives (the "golden set"), point a new agent prompt or model at them, and Mitosis re-runs them in parallel. You get a side-by-side comparison: success rate, cost, latency, output diff.
Where#
/benchmarks. Run before promoting an agent change to production.