Submit - SWE-AGI Benchmark

Submission Requirements

To submit results to the SWE-AGI leaderboard, you must use the official Docker evaluation setup with a public/private test split. This keeps submissions reproducible and comparable.

1

Use Official Docker Workflow

Use the docker/ infrastructure so agents only see public tests in client_data, while private tests remain in server_data.

2

Use Repository Layout Correctly

Use one of two supported layouts: run directly in this monorepo (where tasks/ and docker/ already exist), or create an isolated eval/ bundle that includes both tasks/* and docker/.

3

Include Standard Artifacts

Each submitted task directory should include log.jsonl, log.yaml, and run-metrics.json.

4

Submit Through GitHub PR

Submit results by pull request to SWE-AGI-Eval.

Official Evaluation Workflow

Follow this process from tasks/EVALUATION.md and docker/README.md.

1. Choose a workspace layout

# Option A: run directly in this monorepo
cd <path-to-SWE-AGI>/docker

# Option B: create an isolated bundle (from repo root)
mkdir -p eval/<date>-<runner-name>
cp -R tasks/* eval/<date>-<runner-name>/
cp -R docker eval/<date>-<runner-name>/docker
cd eval/<date>-<runner-name>/docker

2. Build image

docker build --platform=linux/amd64 -t swe-agi:latest .

3. Prepare split data

python3 setup.py

4. Start containers

./start.sh
# or: docker-compose up -d
# or: docker compose up -d

5. Find client container and authenticate once

docker ps --filter name=swe-agi-client
# optional:
# docker compose ps
docker exec -it swe-agi-client-<timestamp> bash
# inside container, login as needed:
#   codex
#   claude
#   gemini
#   kimi
exit

6. Run evaluations

docker exec -d swe-agi-client-<timestamp> swe-agi-run <spec> <runner>
# example:
docker exec -d swe-agi-client-20260206-120000 swe-agi-run toml gpt-5.3-codex

7. Check run artifacts

ls client_data/<spec>/
# expected:
#   log.jsonl
#   log.yaml
#   run-metrics.json

Submission Process (Pull Request)

Copy completed results from your run workspace into SWE-AGI-Eval and open a pull request.

SWE-AGI-Eval/
└── <model-name>/
    ├── toml/
    │   ├── log.jsonl
    │   ├── log.yaml
    │   ├── run-metrics.json
    │   └── ...copied task files...
    ├── yaml/
    │   └── ...
    └── ...

Recommended pre-PR checklist:

log.jsonl, log.yaml, and run-metrics.json exist for each submitted task
Runs complete without execution failures in log.jsonl
Task directories and names match benchmark task names

Validation Notes

During runs, the agent submits with swe-agi-submit. The server validates against the full suite (public + private tests) and returns pass/fail summaries.

Use local public tests for iteration, but treat server-validated results as the benchmark record.

Stopping and Resetting

./stop.sh
# or: docker-compose down
# or: docker compose down

# reset datasets (destructive):
rm -rf client_data server_data
python3 setup.py

Verification

We verify submissions by:

Re-running evaluation on a subset of tasks
Checking that implementations pass hidden private tests
Reviewing run artifacts and metrics for consistency

Submissions that cannot be reproduced or show signs of test contamination will not be included on the leaderboard.

Questions?

Open an issue in SWE-AGI or SWE-AGI-Eval.