About - SWE-AGI Benchmark

Overview

SWE-AGI is an open-source benchmark for evaluating end-to-end, specification-driven construction of production-scale software systems in MoonBit. Tasks require agents to build standards-compliant systems (parsers, interpreters, binary decoders, and SAT solvers) from explicit specifications under a fixed API scaffold.

The benchmark is designed to prioritize reasoning over retrieval. Many target systems are largely absent from the current MoonBit ecosystem, so progress depends on sustained specification understanding, architecture decisions, and long-horizon implementation.

Benchmark Scope

22

Tasks

7

Task Categories

Weeks-Months

Typical Engineering Workload

As described in the paper, SWE-AGI includes 22 tasks across seven categories:

Template and domain-specific languages
Data serialization and configuration formats
Markup and document formats
Programming language front-ends
Binary formats and streaming decoders
Networking and protocol state machines
Automated reasoning and SAT solving

Task Structure

Each task is packaged as a starter repository with a fixed interface:

tasks/<task>/
├── specs/                 # Normative specs and reference documents
├── TASK.md                # Goal, scope, acceptance criteria, constraints
├── *_spec.mbt             # Fixed API declarations + helper contracts
├── *_pub_test.mbt         # Public tests for local iteration
├── *_priv_test.mbt        # Private tests (held out in evaluation checkout)
├── moon.mod.json          # Package manifest and dependencies
└── moon.pkg.json          # Package lockfile (pinned deps)

Agents iterate locally with public tests (typically a small visible subset), can add their own spec-grounded checks, and submit final solutions for hidden private-test scoring.

Evaluation Protocol

SWE-AGI evaluates correctness and robustness via hidden private tests executed through swe-agi-submit. The current paper does not include scored runtime or memory benchmarks.

Agents read specs/ and TASK.md, then implement under a fixed MoonBit API scaffold
Agents run local validation (for example, moon test) using public tests and their own tests
Final scoring is based on hidden private tests returned by swe-agi-submit
Frontier-model evaluations in the paper use a fixed 12-hour wall-clock budget per task

Difficulty Tiers

Easy (6)

Smaller parser/decoder systems, typically around 10³ core LOC. All frontier models in the paper solve 6/6 easy tasks.

CSV Parser (98 tests)
Git Object Parser (1000 tests)
HPACK Decoder (129 tests)
INI Parser (98 tests)
Protocol Buffers (141 tests)
URI Parser (138 tests)

Medium (8)

Multi-module systems with richer state and error semantics. Typical scale is roughly 3x10³ to 5x10³ core LOC.

Cap'n Proto (111 tests)
Pug Template Engine (251 tests)
TOML Parser (733 tests)
URL Parser (1220 tests)
WebAssembly Validator (800 tests)
XML Parser (735 tests)
YAML Parser (345 tests)
ZIP Decoder (1089 tests)

Hard (8)

Large specification-heavy systems (for example full parsers/interpreters) reaching up to about 10⁴ core LOC.

C99 Parser (117 tests)
CDCL SAT Solver (4312 tests)
ECMAScript Parser (618 tests)
HTML5 Parser (8221 tests)
jq Interpreter (218 tests)
Lua Interpreter (137 tests)
Python Parser (653 tests)
R6RS Scheme (1362 tests)

Paper Snapshot

The latest paper reports full evaluations for four frontier models:

gpt-5.3-codex: 19/22 tasks (86.4%)
gpt-5.2-codex: 17/22 tasks (77.3%)
claude-opus-4.6: 15/22 tasks (68.2%)
claude-opus-4.5: 10/22 tasks (45.5%)

All four solve the easy tier; separation widens on medium and hard tasks, highlighting the challenge of long-horizon, specification-driven software construction.

Why MoonBit?

MoonBit provides properties that are central to the benchmark design:

Declaration-first scaffolding via declare, which freezes task interfaces
Integrated toolchain (moon) for build, test, and packaging workflows
Static typing and clear diagnostics to support reliable compile-test-refine loops
Nascent ecosystem for target tasks, reducing superficial retrieval-based shortcuts

Citation

If you use SWE-AGI in your research, please cite:

@misc{sweagi2026,
  title={SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents},
  author={Zhirui Zhang and Hongbo Zhang and Haoxiang Fei and Zhiyuan Bao and Yubin Chen and
          Zhengyu Lei and Ziyue Liu and Yixuan Sun and Mingkun Xiao and Zihang Ye and
          Yu Zhang and Hongcheng Zhu and Yuxiang Wen},
  year={2026},
  howpublished={\url{https://github.com/moonbitlang/SWE-AGI/blob/main/paper/main.pdf}},
  note={MoonBit Project}
}