Overview

11
Models Evaluated
22
Tasks
103-104
Typical Core LOC per Task
22526
Total Tests
3
Difficulty Tiers

Model Comparison

Four-model comparison across six dimensions. Task Passed is shown out of 22 tasks. Scores use a zero baseline for each axis (value / axis max * 100).

Behavior Composition by Model

Each model bar is normalized to 100%. Color encodes behavior category; hover segments to inspect percentage and raw action counts.

Model Summary

Overall performance across all tasks

Model Agent Organization Tasks Passed Test Case Pass Rate Total Cost Total Time
GPT-5.3 Codex Codex CLI OpenAI 19/22
95.6%
$213.07 24.8h
GPT-5.2 Codex Codex CLI OpenAI 17/22
96.4%
$435.72 108.6h
Claude Opus 4.6 Claude Code Anthropic 15/22
90.8%
$2055.81 76.4h
Claude Opus 4.5 Claude Code Anthropic 10/22
81.7%
$507.94 26.8h
Gemini 3 Flash Gemini CLI Google 2/6
49.8%
$31.61 1.5h
GLM-4.7 Claude Code Zhipu AI 2/6
64.2%
$4.86 4.2h
Kimi K2.5 Kimi Code CLI Moonshot AI 2/6
92.0%
N/A 5.9h
DeepSeek V3.2 Claude Code DeepSeek 1/6
16.7%
$4.12 20.2h
Claude Sonnet 4.5 Claude Code Anthropic 0/6
76.1%
$40.67 1.9h
Gemini 3 Pro Gemini CLI Google 0/6
16.5%
N/A 1.8h
Qwen3 Max Claude Code Alibaba 0/6
13.9%
$368.37 15.5h

Results by Difficulty

Performance breakdown by task difficulty tier

Easy Tier

Model Agent Tasks Passed Test Case Pass Rate Avg Time Avg LOC Cost
Claude Opus 4.5 Claude Code 6/6
100.0%
0.39h 1092 $56.69
Claude Opus 4.6 Claude Code 6/6
100.0%
0.45h 1781 $48.61
Claude Sonnet 4.5 Claude Code 0/6
76.1%
0.32h 930 $40.67
DeepSeek V3.2 Claude Code 1/6
16.7%
3.4h 1070 $4.12
Gemini 3 Flash Gemini CLI 2/6
49.8%
0.25h 558 $31.61
Gemini 3 Pro Gemini CLI 0/6
16.5%
0.30h 710 N/A
GLM-4.7 Claude Code 2/6
64.2%
0.70h 904 $4.86
GPT-5.2 Codex Codex CLI 6/6
100.0%
0.81h 1081 $33.51
GPT-5.3 Codex Codex CLI 6/6
100.0%
0.28h 1305 $15.00
Kimi K2.5 Kimi Code CLI 2/6
92.0%
0.99h 1163 N/A
Qwen3 Max Claude Code 0/6
13.9%
2.6h 850 $368.37

Medium Tier

Model Agent Tasks Passed Test Case Pass Rate Avg Time Avg LOC Cost
Claude Opus 4.5 Claude Code 3/8
82.6%
1.3h 3304 $208.43
Claude Opus 4.6 Claude Code 5/8
93.6%
3.5h 4867 $1183.94
GPT-5.2 Codex Codex CLI 7/8
98.9%
5.1h 4702 $287.17
GPT-5.3 Codex Codex CLI 8/8
100.0%
1.2h 2575 $114.14

Hard Tier

Model Agent Tasks Passed Test Case Pass Rate Avg Time Avg LOC Cost
Claude Opus 4.5 Claude Code 1/8
67.0%
1.7h 6603 $242.82
Claude Opus 4.6 Claude Code 4/8
81.2%
5.7h 10103 $823.26
GPT-5.2 Codex Codex CLI 4/8
91.2%
7.8h 9034 $115.04
GPT-5.3 Codex Codex CLI 5/8
87.9%
1.7h 6255 $83.94