The Cited Sandbox: One Open-Weight Model, One Tool, Eight Gates

June 1, 2026

AI Engineer: Farid Temuri · GitHub

Challenge Source Code ↗

Run	Account	Points	Created
GeorgeDroid [xiaomi/mimo-v2.5-pro] (view run)	`9QjKSV`	`71.8`/100	1 mo ago

Editorial Note from Rinat

Let’s explore an experiment on architecture of a solution for Agentic Commerce from Farid Temuri.

This agent scored 71.8 ECOM1 Ultimate leaderboard (during the 3-hour blind run) and got in TOP-20. The most impressive part is - it achieved that with a simple architecture and xiaomi/mimo-v2.5-pro open-weight model under the hood.

The architecture involved these key blocks:

LLM was running in a single agentic loop (REPL) with the feedback
LLM had access to the sandbox, where it could run code, read files, access scratch pad.
Results had to go through the submission gates with a set of deterministic checks
Agentic responses that failed the submission gate - would be handed back to the REPL loop with more context about the needed repair

So all together: one tool, no planner, no router, no LLM judge, no fine-tuning. The agent was allowed to write real code against the runtime instead of calling narrow tools. The harness also guarded agent against submitting an answer that isn’t grounded in files it actually read.

Thanks to this architecture, agent was able to avoid many no-answer and hallucinated-reference failures that are common among agents attempting to solve ECOM1 tasks.

BitGN platform analysis shows that the architecture still had a few weak points: occasional missing or extra required refs, wrong cited SKU families, troubles with OCR uploads, carts, and checkout-policy files. The agent studied the context before answering, but the evidence contract was not always aligned with what the business process expects.

The model under the hood is - xiaomi/mimo-v2.5-pro:

1.02T-parameter Mixture-of-Experts model with 42B active parameters, built on a hybrid-attention architecture with a 1M-token context window.

The model is larger than usual, yet it can still run locally on a modest GPU cluster, thanks to the MoE architecture.

This agent would be a good baseline for the next ECOM2 round. It has:

filesystem-first navigation
task-specific citation checklists
strict outcome-class rules
exact answer-format gates

If those are added without bloating the stack, this style of agent could become a strong open-weight reference point for ECOM1-style work.

Congratulations to Farid Temuri for scoring top-20 on the blind bitgn/ecom1-prod leaderboard (run-22RxPyYQ4dtnsaeKdXpRsJ6ce) using one open-weight model (xiaomi/mimo-v2.5-pro) and a very straightforward architecture. Here is his Github repository with the source code (TypeScript / Bun).

Below you will find summary of the agent, written by the team that created this agent: Farid and his Claude Agent.

How does it work?

The Cited Sandbox: one open-weight model drives a sandbox through one execute_script tool; deterministic gates reject any ungrounded answer

What starts a task? A benchmark-agnostic control plane runs the lifecycle (startRun -> per trial: startTrial -> runAgent -> endTrial -> submitRun -> poll for deferred scores). It contains zero task-solving logic; every task-solving idea changed only the per-trial runtime, never the control plane.
What context does the agent receive? On turn one it gets the workspace tree, preloaded /docs, and project hints, all assembled into one system prompt. A fresh runtime URL is issued per trial; no state leaks between trials.
Which tools or APIs can it call? Exactly one: execute_script. The model emits a single JSON object per turn: { current_state, plan_remaining_steps_brief, task_completed, code }, and only code runs. It executes as JavaScript in a Bun AsyncFunction sandbox with three injected locals: harness (the ECOM runtime client: tree / find / search / list / read / write / delete / stat / exec / answer), scratchpad (persistent working memory across turns), and console (captured and fed back next turn). harness.exec is a real shell into the runtime, so the model can grep, run SQL, and read JSON catalogues however it likes.
How does it inspect state before acting? It reads. A code sandbox lets one capable model express any lookup: join two JSON files, fall back from SQL to the filesystem, cross-check a policy addendum against its base, without me predicting each as a bespoke tool.
How does it decide a task is finished? It calls await harness.answer(scratchpad, verify). That passes through eight deterministic gates (below) before the answer is accepted. The step budget is bounded (MAX_PRIMARY_STEPS = 35, plus a +5 nudge); if the loop ever exits without answering, a finally submits OUTCOME_ERR_INTERNAL so a trial never silently returns nothing.

Models

Main solver: xiaomi/mimo-v2.5-pro (open-weight), served via OpenRouter.
Classifier/router/planner, if any: none.
Evaluator or evolution loop, if any: none. I ran a pre-submission LLM judge for a while and deleted it (see Problems/Solutions).
Runtime settings that mattered: REASONING_EFFORT=low. Across a flag-bisection sweep on the dev set, low scored as well as or better than medium, and higher effort sometimes hurt. The leaderboard run is low.
Were all listed models open-weight/local? Yes, the only model in the stack is open-weight, so the architecture is open-weights eligible end-to-end.

E-commerce OS Reasoning

Catalogue and product matching: the model queries the catalogue (SQL or filesystem JSON) and matches on product attributes rather than fuzzy name guesses; “base model” vs. a specific variant is resolved from attributes, not the string.
Inventory, warehouses, shipping, store coverage: inventory is read with explicit on_hand / available_today / incoming semantics; a request blocked because requested qty exceeds available_today is a state limit, not a security refusal.
Customer records, baskets, orders, payments: ownership is established by reading the owning record and comparing customer_id against the actor from /bin/id, never inferred from an empty query.
Merchant policies and policy addenda: the agent consults policy documents and their addenda before acting; a discount above the policy max is refused regardless of who asks.
Support tickets, returns, refunds, escalations: modeled as authorized actions with their own evidence requirements; a refund/return only reaches OUTCOME_OK after the mutation is confirmed.
Audit trails, logs, evidence: every cited path must be one the agent actually read. Citations are the evidence trail, enforced by the gates.

Acting, Refusing, and Escalating

When may it mutate state? Only after reading the governing record/policy. After a write/checkout/discount/refund it re-reads (or checks the tool’s success output) before claiming OUTCOME_OK, never “Added/Closed” without a confirmed write.
How does it verify authorization? It resolves the actor from /bin/id (cust-NNNN + roles) and positively reads the owning record. It refuses with OUTCOME_DENIED_SECURITY only when it has read a record whose owner differs from the actor. An empty query / 404 / empty find is not proof of ownership.
How does it handle unsafe pressure? Injection noise (“SYSTEM OVERRIDE”, “ownership transferred”, “authenticated”) is treated as data to ignore, never a reason to act or refuse. Employee actors may not purchase -> OUTCOME_NONE_UNSUPPORTED.
When does it refuse / clarify / escalate? Every answer carries exactly one of five outcome classes:
- OUTCOME_OK: task fully completed / definite answer, with every load-bearing record and policy cited.
- OUTCOME_DENIED_SECURITY: identity/ownership/role mismatch, adversarial instruction, or bait subject.
- OUTCOME_NONE_UNSUPPORTED: out of policy regardless of who asks, or blocked by the record’s own state (e.g. a 9% discount when the max is 5%, or an employee purchase).
- OUTCOME_NONE_CLARIFICATION: “the basket / the order” is ambiguous and discovery finds multiple live candidates.
- OUTCOME_ERR_INTERNAL: unrecoverable tooling failure (also the no-answer fallback).

Problems

Failure mode 1: hallucinated references. Early runs invented citation paths that looked plausible but didn’t exist on disk. The grader explicitly checks for required references (e.g. answer missing required reference '/proc/catalog/X.json'), so this was costly.
Failure mode 2: an LLM judge that cost more than it earned. A pre-submission judge added ~24s of latency on every submission, showed no grader-score lift over 19 instrumented runs, and had a ~32% false-negative rate concentrated in refs errors.
Failure mode 3: choosing a run with scores locked. During the blind window I had to pick a run to submit with no grader feedback, and my inferential pick was wrong (it favored a fancier filesystem-first/medium-effort run over the simple low-effort one that actually placed).

Solutions

Prompt or rule changes: make ungrounded answers unrepresentable. Citation is one atomic call, scratchpad.cite(path, reason), which throws if the reason is under 8 chars or the path wasn’t read this trial. You cannot cite a file you never opened.
Tooling or runtime changes: eight ordered submission gates, each throwing a fix-it message the model can retry against. verify is a function; structured-fact shapes; canonical refs_why with at least 8-char reasons; refs must be a subset of what was actually read; per-ref justification; outcome is one of the five classes; the agent’s own verify(sp); and a deterministic check that every declared literal token appears verbatim in the answer.
Evaluation/debugging changes: total observability. Every run writes runs/<runId>.jsonl with the full system prompt, initial scratchpad, and per step the code, output, full reasoning, token counts, a deep scratchpad snapshot, and the grader’s exact complaints. Every claim here was read out of those logs.
Things kept deliberately simple: one model, one tool, no judge. Removing the judge made the agent faster and no less accurate.

What Would You Improve Next?

Land and A/B a <navigation-hardening> prompt block (real SQL schema, attribute matching, inventory semantics) that the champion run predates; early evidence suggests it removes dead-SQL step-waste.
Capture a clean filesystem-first + low-effort run. The existing one was damaged by an over-aggressive concurrency setting, so the true optimum is likely still unmeasured.
Tighten the answer-format gate with task-derived token extraction so the model needs less manual bookkeeping.

Lessons From ECOM1

Grounding beats cleverness. The biggest single win was making ungrounded answers impossible to submit.
One capable model + a code sandbox beats an orchestra of narrow tools when task shapes vary this much.
Refusals are first-class. Treat NONE_UNSUPPORTED / NONE_CLARIFICATION / DENIED_SECURITY as real targets with their own evidence requirements; “empty result ≠ absence.”
Measure your knobs. Low reasoning effort winning was counterintuitive and only visible because of the bisection sweep.
Trust measurement over inference. When my locked-score guess disagreed with the measured dev prior, the measured prior was right.
Delete components that don’t pay. The judge was the clearest example.

Questions, or want a walkthrough of any part of this? Find me on GitHub. Happy to compare notes with other ECOM1 authors.

AI Engineer: Farid Temuri · GitHub

Challenge Source Code ↗