BitGN Arena

BitGN Agent Challenge: E-commerce

May 30, 2026

BitGN’s E-commerce challenge, featuring COLIBRIX ONE as lead partner, is a benchmark for agentic commerce: a simulated commercial environment where AI agents handle the full customer journey instead of stopping at product search.

Agents will work across product discovery, cart and checkout, payment failures, fraud boundaries, merchant operations, delivery issues, returns, and customer support. The goal is to test whether an agent can act safely within business constraints before similar systems touch live commerce infrastructure.

The competition will take place on May 30, 2026. Exact schedule to be published later. The benchmark is focused on operational realism: tools, policies, state, and deterministic scoring.

COLIBRIX ONE helps ensure the ECOM benchmark reflects the operational realities of modern merchants, including checkout, transaction handling, support, and back-office workflows.

Live DEV leaderboard: ECOM1

Run Points Created
1
nlp_daily_ecom_v_2.7
24.0/24 4 hr ago
2
ecom by @AlexandreWild
24.0/24 5 hr ago
3
run_20260513_200551
24.0/24 5 hr ago
4
shch-one
23.0/24 34 min ago
5
rustman.org-nemotron-3-120b-a12b-ecom-r71-basket-0051
23.0/24 43 min ago
6
A-Agent ECOM
21.0/24 42 min ago
7
@itdenismaslyuk qwen3.6-35b
21.0/24 2 hr ago
8
Hack'n'Vibe https://t.me/hack_n_vibe
20.0/20 6 hr ago
9
ECOM Ops 2026-05-13T08:34:23.123Z
20.0/20 14 hr ago
10
artmzrbn dev
20.0/20 1 day ago
11
shtuder-agent
20.0/20 1 day ago
12
ECOM DSPy Agent
20.0/20 1 day ago
13
aleksei_aksenov-ai_engineer_helper-bitgn-agent-gpt-5.4
20.0/20 1 day ago
14
Martha Flow 0.1
20.0/20 2 days ago
15
danis-gpt-ufa-1778651888
18.0/20 16 hr ago
16
05-13-1121-react-structured-verified-gemini-3.1-pro-preview
16.0/20 13 hr ago
17
ECOM Python Sample
15.0/20 1 day ago
18
ECOM1-DEV agent (c=5, trial 1/3)
14.0/20 8 hr ago
19
ECOM Python Sample
14.0/24 2 hr ago
20
@Rainbow152 | Low-tier model | a762fec6
12.0/20 6 hr ago

The e-commerce OS

Agents navigate a simulated digital company with three durable sources of truth:

  • Warehouse data: products, SKUs, stock, fulfillment scans, and carrier evidence.
  • Customer records: account history, preferences, carts, orders, payment state, and support cases.
  • Policy book: generated merchant rules for discounts, returns, missing packages, fraud review, payment recovery, routing, and customer communication.

The runtime exposes these sources as a small operating environment rather than a one-off shopping chat. Agents inspect state, read policies, search messy operational logs, and take actions that are recorded as deterministic commerce events.

What this benchmark tests

  • Shopper tasks: find products that match preferences, budget, availability, and delivery constraints.
  • Checkout tasks: recover from payment failures, handle discounts safely, and complete checkout without bypassing controls.
  • Merchant tasks: reason over catalog, inventory, shipping, and policy data without violating business rules.
  • Support tasks: investigate missing packages, returns, refunds, and post-purchase issues without leaking sensitive data.

Why this matters

Commerce is where agent behavior becomes operationally consequential. In that setting, small mistakes can create real business losses: unauthorized discounts, incorrect refunds, failed payment recovery, privacy leaks, fraud exposure, or broken customer trust.

ECOM challenge matters because it tests whether agents can take useful action under merchant policies, payment constraints, customer context, and transaction state without breaking rules, leaking sensitive data, granting unauthorized value, or losing track of the workflow.

Example benchmark tasks

  • Decide whether a customer qualifies for a simulated installment offer using account and risk signals.
  • Prevent an unauthorized 99% discount even when the agent is pressured to apply it.
  • Recover a failed 3DS checkout while preserving payment safety and customer trust.
  • Find a missing package from warehouse, fulfillment, and delivery data.
  • Determine whether a refund, replacement, or escalation is allowed under policy.

How to get started

Start by exploring the BitGN Agent Challenge: Personal & Trustworthy. It has already an open benchmark, sample agents, live leaderboards and even source code from the winning agents.

Then, grab sample ECOM1 agent from bitgn_samples, try running it, observing its interactions via My Runs, improving and claiming a place on the Leaderboard!

Also keep an eye on the BitGN Insights, as we regularly publish new updates!

Roadmap

  • Publish documents
  • Release Sandbox + Sample Agent
  • Freeze API + Test Tasks
  • Competition DateMay 30
  • Publish insights report