Overview

SWE-AGI is an open-source benchmark for evaluating end-to-end, specification-driven construction of production-scale software systems in MoonBit. Tasks require agents to build standards-compliant systems (parsers, interpreters, binary decoders, and SAT solvers) from explicit specifications under a fixed API scaffold.

The benchmark is designed to prioritize reasoning over retrieval. Many target systems are largely absent from the current MoonBit ecosystem, so progress depends on sustained specification understanding, architecture decisions, and long-horizon implementation.

Benchmark Scope

22
Tasks
7
Task Categories
Weeks-Months
Typical Engineering Workload

As described in the paper, SWE-AGI includes 22 tasks across seven categories:

  • Template and domain-specific languages
  • Data serialization and configuration formats
  • Markup and document formats
  • Programming language front-ends
  • Binary formats and streaming decoders
  • Networking and protocol state machines
  • Automated reasoning and SAT solving

Task Structure

Each task is packaged as a starter repository with a fixed interface:

tasks/<task>/
├── specs/                 # Normative specs and reference documents
├── TASK.md                # Goal, scope, acceptance criteria, constraints
├── *_spec.mbt             # Fixed API declarations + helper contracts
├── *_pub_test.mbt         # Public tests for local iteration
├── *_priv_test.mbt        # Private tests (held out in evaluation checkout)
├── moon.mod.json          # Package manifest and dependencies
└── moon.pkg.json          # Package lockfile (pinned deps)

Agents iterate locally with public tests (typically a small visible subset), can add their own spec-grounded checks, and submit final solutions for hidden private-test scoring.

Evaluation Protocol

SWE-AGI evaluates correctness and robustness via hidden private tests executed through swe-agi-submit. The current paper does not include scored runtime or memory benchmarks.

  • Agents read specs/ and TASK.md, then implement under a fixed MoonBit API scaffold
  • Agents run local validation (for example, moon test) using public tests and their own tests
  • Final scoring is based on hidden private tests returned by swe-agi-submit
  • Frontier-model evaluations in the paper use a fixed 12-hour wall-clock budget per task

Difficulty Tiers

Easy (6)

Smaller parser/decoder systems, typically around 103 core LOC. All frontier models in the paper solve 6/6 easy tasks.

  • CSV Parser (98 tests)
  • Git Object Parser (1000 tests)
  • HPACK Decoder (129 tests)
  • INI Parser (98 tests)
  • Protocol Buffers (141 tests)
  • URI Parser (138 tests)

Medium (8)

Multi-module systems with richer state and error semantics. Typical scale is roughly 3x103 to 5x103 core LOC.

  • Cap'n Proto (111 tests)
  • Pug Template Engine (251 tests)
  • TOML Parser (733 tests)
  • URL Parser (1220 tests)
  • WebAssembly Validator (800 tests)
  • XML Parser (735 tests)
  • YAML Parser (345 tests)
  • ZIP Decoder (1089 tests)

Hard (8)

Large specification-heavy systems (for example full parsers/interpreters) reaching up to about 104 core LOC.

  • C99 Parser (117 tests)
  • CDCL SAT Solver (4312 tests)
  • ECMAScript Parser (618 tests)
  • HTML5 Parser (8221 tests)
  • jq Interpreter (218 tests)
  • Lua Interpreter (137 tests)
  • Python Parser (653 tests)
  • R6RS Scheme (1362 tests)

Paper Snapshot

The latest paper reports full evaluations for four frontier models:

  • gpt-5.3-codex: 19/22 tasks (86.4%)
  • gpt-5.2-codex: 17/22 tasks (77.3%)
  • claude-opus-4.6: 15/22 tasks (68.2%)
  • claude-opus-4.5: 10/22 tasks (45.5%)

All four solve the easy tier; separation widens on medium and hard tasks, highlighting the challenge of long-horizon, specification-driven software construction.

Why MoonBit?

MoonBit provides properties that are central to the benchmark design:

  • Declaration-first scaffolding via declare, which freezes task interfaces
  • Integrated toolchain (moon) for build, test, and packaging workflows
  • Static typing and clear diagnostics to support reliable compile-test-refine loops
  • Nascent ecosystem for target tasks, reducing superficial retrieval-based shortcuts

Citation

If you use SWE-AGI in your research, please cite:

@misc{sweagi2026,
  title={SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents},
  author={Zhirui Zhang and Hongbo Zhang and Haoxiang Fei and Zhiyuan Bao and Yubin Chen and
          Zhengyu Lei and Ziyue Liu and Yixuan Sun and Mingkun Xiao and Zihang Ye and
          Yu Zhang and Hongcheng Zhu and Yuxiang Wen},
  year={2026},
  howpublished={\url{https://github.com/moonbitlang/SWE-AGI/blob/main/paper/main.pdf}},
  note={MoonBit Project}
}