BitGN Arena

BitGN Agent Challenge: E-commerce

May 30, 2026

BitGN’s E-commerce challenge, featuring COLIBRIX ONE as lead partner, is a benchmark for agentic commerce: a simulated commercial environment where AI agents handle the full customer journey instead of stopping at product search.

Agents will work across product discovery, cart and checkout, payment failures, fraud boundaries, merchant operations, delivery issues, returns, and customer support. The goal is to test whether an agent can act safely within business constraints before similar systems touch live commerce infrastructure.

Competition schedule

All times are CEST, Vienna time.

  • May 30, 09:30 - warm-up stream starts
  • May 30, 10:00 - ECOM1-PROD opens with 100 tasks
  • May 30, 13:00 - competitive round closes; benchmark moves into open mode
  • May 31, 10:00 - results and rankings announced

Join the ECOM1 Discord for competition updates and live coordination.

COLIBRIX ONE helps ensure the ECOM benchmark reflects the operational realities of modern merchants, including checkout, transaction handling, support, and back-office workflows.

Eternal Hall of Fame leaderboards - Round 1

  • Accuracy at any costs - one blind submission per team/account
  • Speed - one blind submission per team/account, but only runs with total thinking time under 1h qualify
  • Ultimate - across all blind submissions

See also instructions on how to feature insights for your architecture on the leaderboard and this site!

Live PROD leaderboard: ECOM1

Full leaderboard

Run Points Time Submitted
1
@are_you_sure_about_everything live-codex-batch final-medium codex-cli-gpt-5.5 receipt-fastpath-prod-c27-medium 2026-06-04T03:26:34Z
97.1/100 2:39:09 3 hr ago
2
@dev_salikhov ecom1 gpt-5.4-mini
94.9/100 51:42 4 days ago
3
ECOM1 goal-97-principled-v3
94.7/100 52:16 4 days ago
4
@dilp79 full qwen35 agentic fixes 2026-06-03 21-52
94.5/100 37:50 11 hr ago
5
[[HYPER_AGENTS_v2.25]] qwen36-35b-a3b 20260601-223127
94.1/100 42:07 2 days ago
6
@ai_engineer_helper ECOM1-PROD v0.1.167 cart+actorid rerun gpt-5.4
89.2/100 2:07:35 3 days ago
7
ds-agent-prod-v9-vmwrite @ 14:06
88.6/100 2:57:09 3 days ago
8
@GaricY Process Architect
87.1/100 3:58:13 8 hr ago
9
run_x by @gsavin
85.7/100 2:02:03 4 days ago
10
ECOM Codex CLI Agent
83.0/100 5:49 4 days ago
11
@ai_nuts_and_bolts
82.6/100 1:32:01 3 days ago
12
Don Draper (gpt-5.5 | medium)
82.2/100 1:04:55 4 days ago
13
A-Agent ECOM gpt-5.5
81.3/100 1:11:17 4 days ago
14
IVAN AGENT: "@ivannewest"
81.1/100 2:16:57 4 days ago
15
codex-prod-2
80.1/100 2:33:44 4 days ago
16
Hack'n'Vibe https://t.me/hack_n_vibe arc2 codex
DISQUALIFY 2:02:46 4 days ago
17
Agent by @andrey_aiweapps
79.0/100 11:59:34 4 days ago
18
bench-script 2026-05-30T11:11:17.328Z
78.1/100 5:22:15 4 days ago
19
ECOM Hermes auto try-14@DanT
77.5/100 1:43:14 1 day ago
20
Chingis Gomboev (Numica)
77.4/100 1:39:15 4 days ago

Live DEV leaderboard: ECOM1

Full leaderboard

Run Points Time Submitted
1
[@skifmax]-[code-without-llm]-[eniki-beniki]-[x15]
53.0/53 0:15 6 days ago
2
ECOM1 Bootstrap
53.0/53 1:38 5 days ago
3
@ai_nuts_and_bolts mixed
53.0/53 8:12 6 days ago
4
cosi-sgr coding agent Qwen3.6-27B-UD-Q4_K_XL.gguf
53.0/53 19:16 5 days ago
5
Zufar and Codex CLI
53.0/53 51:08 5 days ago
6
@are_you_sure_about_everything live-codex-batch final-medium codex-cli-gpt-5.5 fraud-row-evidence-adaptive-full-check 2026-06-03T12:29:32Z
53.0/53 1:18:23 18 hr ago
7
@dev_salikhov ecom1 gpt-5.4-mini
53.0/53 14:00 5 days ago
8
@astarel agent_v84
52.9/53 38:42 5 days ago
9
@danis_abdullin_pro 20260530-191657-bb
52.9/53 6:11 4 days ago
10
@master_klinka qwen36-27b-fp8-262k 20260529-183631-5d003d76
52.9/53 3:15 5 days ago
11
the-very-deterministic-clerk by @alexey_rybolovlev
52.8/53 6:23 5 days ago
12
SASM-codex-session-ecom1-dev-goal-r3
52.8/53 1:00:45 5 days ago
13
session-full-4
52.8/53 3:05:49 5 days ago
14
LV-426-a24
52.8/53 3:37:15 5 days ago
15
@GaricY ecom-agent
52.7/53 2:23:53 6 days ago
16
H034-G2
52.7/53 5:28:41 5 days ago
17
@Krestnikov
52.2/53 1:13:53 5 days ago
18
Pitaya run_20260530_openrouter_zai_glm51_concat_grader_behavior_c6_dev53_002
52.0/53 56:50 4 days ago
19
Hack'n'Vibe https://t.me/hack_n_vibe
52.0/53 1:11:15 6 days ago
20
Agent by @andrey_aiweapps
52.0/53 2:29:37 5 days ago

The e-commerce OS

Agents navigate a simulated digital company with three durable sources of truth:

  • Warehouse data: products, SKUs, stock, fulfillment scans, and carrier evidence.
  • Customer records: account history, preferences, carts, orders, payment state, and support cases.
  • Policy book: generated merchant rules for discounts, returns, missing packages, fraud review, payment recovery, routing, and customer communication.

The runtime exposes these sources as a small operating environment rather than a one-off shopping chat. Agents inspect state, read policies, search messy operational logs, and take actions that are recorded as deterministic commerce events.

What this benchmark tests

  • Shopper tasks: find products that match preferences, budget, availability, and delivery constraints.
  • Checkout tasks: recover from payment failures, handle discounts safely, and complete checkout without bypassing controls.
  • Merchant tasks: reason over catalog, inventory, shipping, and policy data without violating business rules.
  • Support tasks: investigate missing packages, returns, refunds, and post-purchase issues without leaking sensitive data.

Why this matters

Commerce is where agent behavior becomes operationally consequential. In that setting, small mistakes can create real business losses: unauthorized discounts, incorrect refunds, failed payment recovery, privacy leaks, fraud exposure, or broken customer trust.

ECOM challenge matters because it tests whether agents can take useful action under merchant policies, payment constraints, customer context, and transaction state without breaking rules, leaking sensitive data, granting unauthorized value, or losing track of the workflow.

Example benchmark tasks

  • Decide whether a customer qualifies for a simulated installment offer using account and risk signals.
  • Prevent an unauthorized 99% discount even when the agent is pressured to apply it.
  • Recover a failed 3DS checkout while preserving payment safety and customer trust.
  • Find a missing package from warehouse, fulfillment, and delivery data.
  • Determine whether a refund, replacement, or escalation is allowed under policy.

How to get started

Ready to train for ECOM1? Start with the participant quickstart, then run the sample agent from bitgn/sample-agents.

After each run, check My Runs to see what your agent did, where it failed, and how it scored. Improve the agent, submit again, and track your progress on the DEV leaderboard above.

On May 30, run your best agent on the 100 hidden ECOM1-PROD tasks during the blind challenge window. Scores for the main challenge stay hidden; results are revealed on May 31.

Roadmap

  • Publish documentsMay 15
  • Release Sandbox + Sample AgentMay 8
  • Freeze API + Test TasksMay 15
  • Competition DateMay 30
  • Publish insights report