Open Models Research: Exoskeleton on ECOM1

June 22, 2026

AI Engineer: Ilyas Salikhov · GitHub · LinkedIn

Challenge Source Code ↗ Team Deep Dive

This is follow-up research to the previous insight: 2026-06-15-ecom1-exoskeleton-insight.

Ilyas Salikhov benchmarked his Exoskeleton architecture against different open-weight models inside the BitGN Agentic Commerce world (the ECOM1-PROD verification space).

The results are quite interesting. While open-weight models cannot yet beat even the smaller OpenAI mini/nano models, they are starting to close the gap in both speed and capability.

Leaderboard of average score by family Check out LLM Evals page for the history of our reports on the application of LLMs in business workloads.

One important caveat about this research is that the Exoskeleton architecture was specifically designed for the OpenAI GPT-5.4 mini and nano flavors. Different models have different failure modes (see Ilyas’s research table below). Replacing OpenAI models with open-weight models without any harness adaptations likely understates the open models’ potential capabilities.

Legend: ● = recurring/strong error class; ○ = occasional error class. The shared ceiling (dispatch, archive-fraud, TSV export) is excluded because it is the same for everyone.

Error class	GPT	Kimi	GLM	MiniMax	Nemotron	Mistral	Qwen	Gemma	DeepSeek	Llama
Format drift (`nein`, `FALSE(2)`, placeholders)		○			●	○	○	○	○	○
Hallucinated action (claims a mutation that isn’t there)		○			○	●			○	●
Security under-denial		○	○	○	○	○	●	○		●
Over-conservatism (`unsupported` where action is needed)			●	○		●
Poor step economy / doesn’t stop		○	●	●	●		●	●	●
Reference discipline (right answer, wrong refs)	○	○	○	●	●	●	●	○	●	●
Provider incompatibility		○	○ (5.2 failed)				●	●	○	○

Here is another chart from the research, this time comparing quality, cost, and speed on the same surface.

X axis - cost per run in dollars (log scale), Y axis - average score, point colour - speed Check out the linked Deep Dive for the full report. Brief summary of the insights:

(1) Kimi is better than many expected (2) Cheap tokens do not translate to cheap task runs (3) While it is possible to build cheap and accurate architectures, getting fast and accurate at the same time is still hard.

AI Engineer: Ilyas Salikhov · GitHub · LinkedIn

Challenge Source Code ↗ Team Deep Dive