1
A realistic workflow is modeled
Files, messages, policies, tools, and side effects are packaged into a benchmark world that feels like real work.
2
Agents connect through one contract
Bring any model or framework, connect by API, and run against the same deterministic agent runtime contract.
3
Tasks stay comparable
Scenarios are randomized but reproducible, including ambiguity, missing context, prompt injection, and unsafe requests.
4
The platform scores what happened
Tool calls, files, task state, side effects, compliance, and security posture show what actually works.