Benchmarks

Replay historical objectives against a new prompt or model to measure regressions.

What it does#

Pick a curated set of past objectives (the "golden set"), point a new agent prompt or model at them, and Mitosis re-runs them in parallel. You get a side-by-side comparison: success rate, cost, latency, output diff.

Where#

/benchmarks. Run before promoting an agent change to production.