Overview

11
Models Evaluated
22
Tasks
103-104
Typical Core LOC per Task
22526
Total Tests
3
Difficulty Tiers

Model Comparison

Four-model comparison across six dimensions. Task Passed is shown out of 22 tasks. Scores use a zero baseline for each axis (value / axis max * 100).

Behavior Composition by Model

Each model bar is normalized to 100%. Color encodes behavior category; hover segments to inspect percentage and raw action counts.

Model Summary

Overall performance across all tasks

Model Organization Tasks Passed Pass Rate Total Cost Total Time
GPT-5.3 Codex OpenAI 19/22
95.6%
$213.07 24.8h
GPT-5.2 Codex OpenAI 17/22
96.4%
$435.72 108.6h
Claude Opus 4.6 Anthropic 15/22
90.8%
$2055.81 76.4h
Claude Opus 4.5 Anthropic 10/22
81.7%
$507.94 26.8h
Gemini 3 Flash Google 2/6
49.8%
$31.61 1.5h
GLM-4.7 Zhipu AI 2/6
64.2%
$4.86 4.2h
Kimi K2.5 Moonshot AI 2/6
92.0%
N/A 5.9h
DeepSeek V3.2 DeepSeek 1/6
16.7%
$4.12 20.2h
Claude Sonnet 4.5 Anthropic 0/6
76.1%
$40.67 1.9h
Gemini 3 Pro Google 0/6
16.5%
N/A 1.8h
Qwen3 Max Alibaba 0/6
13.9%
$368.37 15.5h

Results by Difficulty

Performance breakdown by task difficulty tier

Easy Tier

Model Tasks Passed Pass Rate Avg Time Avg LOC Cost
Claude Opus 4.5 6/6
100.0%
0.39h 1092 $56.69
Claude Opus 4.6 6/6
100.0%
0.45h 1781 $48.61
Claude Sonnet 4.5 0/6
76.1%
0.32h 930 $40.67
DeepSeek V3.2 1/6
16.7%
3.4h 1070 $4.12
Gemini 3 Flash 2/6
49.8%
0.25h 558 $31.61
Gemini 3 Pro 0/6
16.5%
0.30h 710 N/A
GLM-4.7 2/6
64.2%
0.70h 904 $4.86
GPT-5.2 Codex 6/6
100.0%
0.81h 1081 $33.51
GPT-5.3 Codex 6/6
100.0%
0.28h 1305 $15.00
Kimi K2.5 2/6
92.0%
0.99h 1163 N/A
Qwen3 Max 0/6
13.9%
2.6h 850 $368.37

Medium Tier

Model Tasks Passed Pass Rate Avg Time Avg LOC Cost
Claude Opus 4.5 3/8
82.6%
1.3h 3304 $208.43
Claude Opus 4.6 5/8
93.6%
3.5h 4867 $1183.94
GPT-5.2 Codex 7/8
98.9%
5.1h 4702 $287.17
GPT-5.3 Codex 8/8
100.0%
1.2h 2575 $114.14

Hard Tier

Model Tasks Passed Pass Rate Avg Time Avg LOC Cost
Claude Opus 4.5 1/8
67.0%
1.7h 6603 $242.82
Claude Opus 4.6 4/8
81.2%
5.7h 10103 $823.26
GPT-5.2 Codex 4/8
91.2%
7.8h 9034 $115.04
GPT-5.3 Codex 5/8
87.9%
1.7h 6255 $83.94