Submit Results
Run the official evaluation workflow and submit results to the SWE-AGI leaderboard
Submission Requirements
To submit results to the SWE-AGI leaderboard, you must use the official Docker evaluation setup with a public/private test split. This keeps submissions reproducible and comparable.
Use Official Docker Workflow
Use the docker/ infrastructure so agents only see public tests in
client_data, while private tests remain in server_data.
Use Repository Layout Correctly
Use one of two supported layouts: run directly in this monorepo
(where tasks/ and docker/ already exist), or create
an isolated eval/ bundle that includes both
tasks/* and docker/.
Include Standard Artifacts
Each submitted task directory should include log.jsonl,
log.yaml, and run-metrics.json.
Submit Through GitHub PR
Submit results by pull request to SWE-AGI-Eval.
Official Evaluation Workflow
Follow this process from tasks/EVALUATION.md and
docker/README.md.
1. Choose a workspace layout
# Option A: run directly in this monorepo
cd <path-to-SWE-AGI>/docker
# Option B: create an isolated bundle (from repo root)
mkdir -p eval/<date>-<runner-name>
cp -R tasks/* eval/<date>-<runner-name>/
cp -R docker eval/<date>-<runner-name>/docker
cd eval/<date>-<runner-name>/docker
2. Build image
docker build --platform=linux/amd64 -t swe-agi:latest .
3. Prepare split data
python3 setup.py
4. Start containers
./start.sh
# or: docker-compose up -d
# or: docker compose up -d
5. Find client container and authenticate once
docker ps --filter name=swe-agi-client
# optional:
# docker compose ps
docker exec -it swe-agi-client-<timestamp> bash
# inside container, login as needed:
# codex
# claude
# gemini
# kimi
exit
6. Run evaluations
docker exec -d swe-agi-client-<timestamp> swe-agi-run <spec> <runner>
# example:
docker exec -d swe-agi-client-20260206-120000 swe-agi-run toml gpt-5.3-codex
7. Check run artifacts
ls client_data/<spec>/
# expected:
# log.jsonl
# log.yaml
# run-metrics.json
Submission Process (Pull Request)
Copy completed results from your run workspace into SWE-AGI-Eval and open a pull request.
SWE-AGI-Eval/
└── <model-name>/
├── toml/
│ ├── log.jsonl
│ ├── log.yaml
│ ├── run-metrics.json
│ └── ...copied task files...
├── yaml/
│ └── ...
└── ...
Recommended pre-PR checklist:
log.jsonl,log.yaml, andrun-metrics.jsonexist for each submitted task- Runs complete without execution failures in
log.jsonl - Task directories and names match benchmark task names
Validation Notes
During runs, the agent submits with swe-agi-submit. The server
validates against the full suite (public + private tests) and returns pass/fail
summaries.
Use local public tests for iteration, but treat server-validated results as the benchmark record.
Stopping and Resetting
./stop.sh
# or: docker-compose down
# or: docker compose down
# reset datasets (destructive):
rm -rf client_data server_data
python3 setup.py
Verification
We verify submissions by:
- Re-running evaluation on a subset of tasks
- Checking that implementations pass hidden private tests
- Reviewing run artifacts and metrics for consistency
Submissions that cannot be reproduced or show signs of test contamination will not be included on the leaderboard.
Questions?
Open an issue in SWE-AGI or SWE-AGI-Eval.