BitGN Arena
← Back to Insights

Codex-on-Rails: Code-Mediated Execution (PAC1)

April 21, 2026

Run Points Created
codex-on-rails (view run) 87.0/104 1 wk ago

Codex-on-Rails tied for first place in the most demanding part of the BitGN PAC1 blind run, alongside Operation Pangolin.

What is under the hood?

codex-on-rails is not a REPL-style analyst like Pangolin. It is closer to a hardened Codex worker: one codex exec --full-auto run per task, connected to the BitGN runtime through a custom MCP server.

The core of this agent is written in Python. It was built for the challenge using Codex CLI 0.107.0 and gpt-5.4 with high reasoning effort.

The architecture is opinionated. While the BitGN PAC1 runtime exposes multiple agentic tools, codex-on-rails exposes them behind a custom vault-facing MCP server. The model sees a small set of native tools such as vault_read, vault_write, vault_delete, vault_search, vault_find, and vault_tree.

The model does not need to generate code just to access the BitGN runtime, unlike Pangolin. Instead, it accesses the vault directly through a narrow MCP tool interface with strong rails. It can still write local code, shell snippets, or draft files when useful, but runtime access itself happens through the vault_* tools.

The rails around those tools are a large part of the design (hence the codex-on-rails):

  • file reads are wrapped in <vault-file> tags with type, trust, and format metadata
  • the vault tree, root AGENTS.md, and task context are auto-discovered before the run starts
  • grounding references are tracked by the MCP server itself rather than trusted from model output
  • writes are syntax-validated before they hit the runtime
  • the final response must conform to a strict TaskResult schema with explicit outcome codes

A particularly strong part of this design is how much judgment is pushed into the rails:

  • strict trust boundaries between root policy files and untrusted workspace content
  • explicit rules for clarification vs. denial vs. unsupported outcomes
  • canonical field-resolution order for identity, routing, and metadata
  • an optional multi-step protocol for local drafting and verification before committing writes
  • a compaction prompt that preserves trust tags, vault state, and plan progress across long runs

What the blind-run artifacts show

The blind-run artifacts also show the flipside of that design:

  • over-denying or over-clarifying on messy but legitimate requests
  • reliance on prompt discipline and policy rails rather than a programmable persistent memory loop
  • sensitivity to the exact shape of workspace docs, trust markers, and validation rules

1. Finance lookup tasks

These were tasks like:

  • “How much did X charge me for line item Y N days ago?”
  • “How much money did we earn from service line Z since the beginning of month M?”

This family accounts for 6 of the 17 misses.

The agent often found the right records. In representative failures, it read the correct bill or invoice, saw the relevant line item, and then still returned the wrong result:

  • sometimes OUTCOME_NONE_CLARIFICATION instead of a number
  • sometimes a wrong total like 390 instead of 1560
  • sometimes even 0 after reading the relevant invoices

So this was not basic retrieval failure. It was more like the last mile of numeric reasoning breaking down: date scoping, aggregation boundaries, or task interpretation after the evidence was already in view.

2. Inbox tasks that were logically solved, but failed on YAML

This was the most concrete avoidable cluster. 7 of the 17 misses came from inbox workflows that were mostly solved correctly, but the generated outbox email file had invalid YAML frontmatter.

The repeated pattern was the same:

  • the agent resolved the recipient correctly
  • found the right invoices or attachments
  • wrote the outbox draft
  • deleted the inbox item
  • answered OUTCOME_OK

And then grading failed because of invalid frontmatter in the generated email record.

So the benchmark miss was not “the agent did not understand the task”. It was serialization drift in a structured artifact.

3. Security tasks with the wrong terminal state

4 of the 17 misses were inbox tasks where the correct grading outcome was OUTCOME_DENIED_SECURITY, but the agent returned OUTCOME_NONE_CLARIFICATION.

In substance, the agent often understood that the request should not go through. It recognized missing authorization or inappropriate sharing. So the trust reasoning was often close.

But the benchmark wanted a specific terminal state: deny for security reasons, not “I need clarification”.

That is a very PAC1-shaped failure. The agent is close in intent, but the benchmark measures the exact contract, not approximate behavior.

What to copy from this architecture

The solution claimed first place and matched Pangolin with 87.0 points while beating the rest of the competition.

To reproduce this architecture in your agent:

  • use a strong frontier model through a coding agent runtime
  • expose the benchmark through a custom MCP bridge with domain-specific tools
  • annotate reads with trust and file-type metadata
  • track grounding refs outside the model
  • validate writes before they reach the runtime
  • force structured outputs with explicit outcome codes
  • evolve the prompt around concrete benchmark failure modes

Congratulations to Igor Inozemtsev for scoring 1st place with 87.0 points (shared with Operation Pangolin) in the Accuracy Leaderboard of the BitGN PAC1 blind run.

The Accuracy leaderboard is the most challenging leaderboard because it targets absolute precision: participants must blindly pick only one submission out of their runs and hope that it scored high enough.