Quality benchmarking beta

week 2026-W19 · 5584 runs · generated 2026-05-11T12:41:46+00:00 · embeddings off

Alerts

SeverityKindModelProviderMessage
warningdriftglm-5.1crofaiglm-5.1 on crofai is 10.0% behind zai (z=-1.01)
criticaldriftminimax-m2syntheticminimax-m2 on synthetic is 70.0% behind minimax (z=-5.50)
warningdriftdeepseek-v3.2crofaideepseek-v3.2 on crofai is 23.3% behind synthetic (z=-1.86)
warningdriftdeepseek-v3.2crofaideepseek-v3.2 on crofai is 22.2% behind synthetic (z=-1.96)
warningrefusal_spikedeepseek-v3.2crofairefusal rate +16.7% vs synthetic
criticaldriftminimax-m2syntheticminimax-m2 on synthetic is 55.6% behind minimax (z=-5.26)
criticaldriftminimax-m2.1syntheticminimax-m2.1 on synthetic is 83.3% behind minimax (z=-6.55)
criticaldriftminimax-m2.1syntheticminimax-m2.1 on synthetic is 58.3% behind minimax (z=-5.44)
warningdriftminimax-m2.5crofaiminimax-m2.5 on crofai is 15.0% behind minimax (z=-1.42)

Developer ranking

#ModelProviderScorePassRefusalLatencyReal tok/sRuns
1gpt-4ocopilot0.90175.0%0.0%5971 ms79 n=1616
2glm-5.1
hf:zai-org/GLM-5.1
synthetic0.89883.3%0.0%18047 ms127 n=3236
3kimi-k2.5alibaba0.89883.3%0.0%23477 ms34 n=3236
4qwen3-coder-480b
hf:Qwen/Qwen3-Coder-480B-A35B-Instruct
synthetic0.89883.3%0.0%15965 ms45 n=3236
5gpt-oss-120b
hf:openai/gpt-oss-120b
synthetic0.89883.3%0.0%6736 ms134 n=3236
6deepseek-r1
deepseek-reasoner
deepseek0.88680.6%0.0%2701 ms2268 n=3236
7gpt-5.2openai0.88481.2%0.0%7798 ms45 n=1616
8glm-5alibaba0.88481.2%0.0%77372 ms46 n=1616
9gpt-5.3-codexopenai0.88481.2%0.0%8194 ms46 n=1616
10gemini-3.1-pro-previewcopilot0.88481.2%0.0%12277 ms26 n=1616
11qwen3-max-2026-01-23alibaba0.88481.2%0.0%10918 ms26 n=1616
12gpt-4.1copilot0.88481.2%0.0%4708 ms69 n=1616
13gpt-5.4-miniopenai0.88481.2%0.0%3616 ms95 n=1616
14gpt-5.2copilot0.88481.2%0.0%3994 ms74 n=1616
15gpt-5.4-mini-fastopenai0.88481.2%0.0%3596 ms86 n=1616
16claude-sonnet-4.5
claude-sonnet-4-5
claude0.87875.0%0.0%6871 ms67 n=1616
17claude-sonnet-4.5copilot0.87875.0%0.0%7456 ms66 n=1616
18hf:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4synthetic0.87675.0%0.0%6989 ms122 n=1616
19hf:meta-llama/Llama-3.3-70B-Instructsynthetic0.87175.0%0.0%7006 ms77 n=1616
20deepseek-v3.2
deepseek-chat
deepseek0.86675.0%0.0%2122 ms1098 n=3236
21glm-5.1
glm-5.1-precision
crofai0.86379.2%0.0%34600 ms56 n=6372
22gpt-5.4-fastopenai0.84875.0%0.0%6089 ms40 n=1616
23claude-sonnet-4.6
claude-sonnet-4-6
claude0.83875.0%0.0%8767 ms67 n=3236
24qwen3.5-plusalibaba0.83075.0%0.0%82819 ms54 n=1516
25grok-code-fast-1copilot0.83075.0%0.0%29713 ms10 n=1616
26openai/gpt-oss-120b:freeopenrouter0.82468.8%0.0%963 ms547 n=1616
27gemini-2.5-procopilot0.82275.0%0.0%10567 ms48 n=3236
28glm-5.1zai0.81972.2%2.8%37501 ms57 n=3236
29deepseek-v4-prodeepseek0.81875.0%8.3%5806 ms3026 n=3236
30deepseek-v4-flashdeepseek0.81072.2%8.3%3766 ms3133 n=3236
31gemini-3-flash
gemini-3-flash-preview
copilot0.80672.2%2.8%7486 ms44 n=3236
32claude-sonnet-4.6copilot0.80672.2%0.0%3608 ms63 n=3236
33gpt-5.4openai0.80075.0%0.0%8211 ms38 n=1616
34claude-opus-4.7
claude-opus-4-7
claude0.79969.4%0.0%5421 ms80 n=3236
35kimi-k2.6
kimi-k2.6-precision
crofai0.79873.6%2.8%25284 ms57 n=6272
36claude-haiku-4.5copilot0.79069.4%0.0%4068 ms131 n=3236
37glm-4.7zai0.77575.0%6.2%75511 ms41 n=1616
38glm-5-turbozai0.77575.0%6.2%34618 ms63 n=1616
39MiniMax-M2.5alibaba0.77568.8%0.0%53205 ms59 n=1616
40qwen3-coder-nextalibaba0.77568.8%0.0%3448 ms127 n=1616
41qwen3-coder-plusalibaba0.77568.8%0.0%4059 ms60 n=1616
42hf:zai-org/GLM-5synthetic0.77575.0%6.2%33968 ms118 n=1616
43claude-haiku-4.5
claude-haiku-4-5-20251001
claude0.77466.7%0.0%3539 ms146 n=3236
44claude-opus-4.5
claude-opus-4-5
claude0.77466.7%0.0%5790 ms72 n=3236
45gpt-5.5openai0.76468.8%0.0%17303 ms56 n=1616
46deepseek-v3.2
hf:deepseek-ai/DeepSeek-V3.2
synthetic0.76275.0%2.8%26285 ms76 n=3236
47qwen3.5-9b-chatcrofai0.75950.0%0.0%7602 ms60 n=1616
48gemma-4-31b-itcrofai0.75162.5%6.2%25919 ms95 n=1616
49glm-4.7alibaba0.74870.6%0.0%51912 ms73 n=1617
50glm-4.5-airzai0.73968.8%12.5%38974 ms69 n=1616
51deepseek-v4-procrofai0.73872.2%2.8%42224 ms58 n=3036
52hf:zai-org/GLM-4.7-Flashsynthetic0.73756.2%6.2%18242 ms168 n=1616
53z-ai/glm-4.5-air:freeopenrouter0.72068.8%6.2%15818 ms494 n=1516
54qwen3.6-27bcrofai0.71162.5%0.0%25853 ms111 n=1616
55poolside/laguna-xs.2:freeopenrouter0.70550.0%6.2%462 ms1824 n=1616
56qwen3.6-plusalibaba0.68468.8%0.0%119876 ms54 n=1316
57minimax-m2.5crofai0.67958.3%22.2%30032 ms82 n=3236
58minimax-m2.7
MiniMax-M2.7MiniMax-M2.7-highspeed
minimax0.67363.9%13.9%37635 ms59 n=6472
59qwen3.5-9bcrofai0.66850.0%12.5%25253 ms96 n=1616
60poolside/laguna-m.1:freeopenrouter0.66656.2%12.5%936 ms1801 n=1616
61minimax-m2.1
MiniMax-M2.1
minimax0.65858.3%19.4%13848 ms193 n=3236
62nvidia/nemotron-nano-9b-v2:freeopenrouter0.64856.2%12.5%484 ms5254 n=1616
63minimax-m2.5
MiniMax-M2.5MiniMax-M2.5-highspeed
minimax0.64859.7%19.4%25584 ms70 n=6472
64hf:Qwen/Qwen3.5-397B-A17Bsynthetic0.64250.0%6.2%28148 ms129 n=1616
65minimax-m2
MiniMax-M2
minimax0.63755.6%19.4%19294 ms107 n=3236
66hf:zai-org/GLM-4.7synthetic0.63062.5%6.2%23881 ms168 n=1616
67minimax-m2.5
hf:MiniMaxAI/MiniMax-M2.5
synthetic0.62661.1%22.2%16525 ms127 n=3236
68deepseek-v3.2crofai0.62352.8%19.4%52041 ms95 n=3036
69kimi-k2.5
kimi-k2.5-lightning
crofai0.60352.8%9.7%35626 ms118 n=6072
70gpt-5.5-fastopenai0.54550.0%0.0%11374 ms51 n=1516
71deepseek-r1
hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528
synthetic0.53148.6%4.2%22601 ms112 n=4972
72gpt-5-minicopilot0.51850.0%0.0%12403 ms72 n=1616
73nvidia/nemotron-3-nano-30b-a3b:freeopenrouter0.46437.5%0.0%586 ms1758 n=1616
74deepseek-v3
hf:deepseek-ai/DeepSeek-V3hf:deepseek-ai/DeepSeek-V3-0324
synthetic0.43940.3%0.0%11168 ms38 n=3272
75llama3.1-8bcerebras0.42031.2%0.0%301 ms1094 n=1116
76kimi-k2.5
hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4
synthetic0.31529.2%16.7%77458 ms34 n=3572
77nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:freeopenrouter0.26312.5%50.0%729 ms1174 n=1616
78liquid/lfm-2.5-1.2b-thinking:freeopenrouter0.25612.5%18.8%498 ms6682 n=1616
79hf:moonshotai/Kimi-K2.6synthetic0.1186.2%12.5%20255 ms162 n=1616
80qwen-3-235b-a22b-instruct-2507cerebras0.0366.2%0.0%172 ms635 n=116
81deepseek-v3.1
hf:deepseek-ai/DeepSeek-V3.1hf:deepseek-ai/DeepSeek-V3.1-Terminus
synthetic0.0280.0%0.0%337 ms72
82kimi-k2
hf:moonshotai/Kimi-K2-Instruct-0905
synthetic0.0280.0%0.0%337 ms36
83kimi-k2-thinking
hf:moonshotai/Kimi-K2-Thinking
synthetic0.0280.0%0.0%326 ms36
84minimax-m2
hf:MiniMaxAI/MiniMax-M2
synthetic0.0280.0%0.0%330 ms36
85minimax-m2.1
hf:MiniMaxAI/MiniMax-M2.1
synthetic0.0280.0%0.0%326 ms36
86gpt-5.4copilot0.0280.0%0.0%527 ms36
87gpt-5.4-minicopilot0.0280.0%0.0%351 ms36
88claude-opus-4.5copilot0.0280.0%0.0%350 ms36
89qwen3-235b
hf:Qwen/Qwen3-235B-A22B-Instruct-2507
synthetic0.0280.0%0.0%335 ms36
90llama-4-maverick
hf:meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
synthetic0.0280.0%0.0%329 ms36
91gpt-oss-120bcerebras0.0000.0%0.0%127 ms16
92zai-glm-4.7cerebras0.0000.0%0.0%129 ms16
93glm-4.7crofai0.0000.0%0.0%848 ms16
94glm-4.7-flashcrofai0.0000.0%0.0%455 ms16
95gpt-5-codexopenai0.0000.0%0.0%192 ms16
96gpt-5.1-codexopenai0.0000.0%0.0%164 ms16
97gpt-5.1-codex-maxopenai0.0000.0%0.0%171 ms16
98gpt-5.1-codex-miniopenai0.0000.0%0.0%288 ms16
99glm-5v-turbozai0.0000.0%0.0%689 ms16
100glm-5crofai0.0000.0%0.0%598 ms16
101gpt-5.2-codexopenai0.0000.0%0.0%181 ms16
102gpt-5.3-codex-sparkopenai0.0000.0%0.0%186 ms16
103hf:meta-llama/Llama-3.1-405B-Instructsynthetic0.0000.0%0.0%349 ms16
104hf:meta-llama/Llama-3.1-70B-Instructsynthetic0.0000.0%0.0%331 ms16
105gregcrofai0.0000.0%0.0%602 ms16
106hf:meta-llama/Llama-4-Scout-17B-16E-Instructsynthetic0.0000.0%0.0%329 ms16
107gpt-5.2-codexcopilot0.0000.0%0.0%371 ms16
108gpt-5.3-codexcopilot0.0000.0%0.0%371 ms16
109gpt-5.5-proopenai0.0000.0%0.0%166 ms16
110qwen3.5-397b-a17bcrofai0.0000.0%0.0%544 ms16
111hf:Qwen/Qwen2.5-Coder-32B-Instructsynthetic0.0000.0%0.0%321 ms16
112hf:Qwen/Qwen3-235B-A22B-Thinking-2507synthetic0.0000.0%0.0%788 ms16
113hf:zai-org/GLM-4.6synthetic0.0000.0%0.0%329 ms16

Architect ranking

#ModelProviderScorePassRefusalLatencyReal tok/sRuns
1gpt-5.2openai0.99093.3%0.0%50349 ms54 n=1515
2openai/gpt-oss-120b:freeopenrouter0.98893.3%0.0%1355 ms3293 n=1515
3gpt-5.5-fastopenai0.98586.7%0.0%80291 ms61 n=1515
4gpt-5-minicopilot0.98586.7%0.0%53603 ms89 n=1515
5claude-sonnet-4.5copilot0.98293.3%0.0%86891 ms64 n=1515
6glm-5-turbozai0.97680.0%0.0%150913 ms28 n=1515
7nvidia/nemotron-3-nano-30b-a3b:freeopenrouter0.97186.7%0.0%667 ms6775 n=1515
8gpt-5.3-codexopenai0.96980.0%0.0%33636 ms53 n=1515
9gpt-5.2copilot0.96986.7%0.0%46512 ms59 n=1515
10claude-opus-4.7
claude-opus-4-7
claude0.96780.0%0.0%68064 ms57 n=3030
11claude-sonnet-4.5
claude-sonnet-4-5
claude0.96580.0%0.0%105146 ms52 n=1515
12hf:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4synthetic0.96486.7%0.0%37711 ms133 n=1515
13qwen3.5-plusalibaba0.96266.7%0.0%95567 ms54 n=1515
14MiniMax-M2.5alibaba0.96280.0%0.0%72432 ms62 n=1515
15hf:Qwen/Qwen3.5-397B-A17Bsynthetic0.96180.0%0.0%56486 ms84 n=1515
16hf:zai-org/GLM-5synthetic0.95780.0%0.0%55637 ms95 n=1515
17qwen3.6-plusalibaba0.95580.0%0.0%107257 ms54 n=1515
18glm-4.7alibaba0.95486.7%0.0%82250 ms66 n=3030
19qwen3-max-2026-01-23alibaba0.95273.3%0.0%64856 ms32 n=1515
20gpt-5.4-fastopenai0.94680.0%0.0%82671 ms48 n=1415
21gpt-5.5openai0.94386.7%0.0%78771 ms59 n=1415
22gpt-5.4-miniopenai0.94180.0%0.0%18936 ms144 n=1515
23claude-sonnet-4.6copilot0.94193.3%0.0%82968 ms57 n=3030
24glm-5alibaba0.94080.0%0.0%128330 ms37 n=1515
25hf:zai-org/GLM-4.7synthetic0.94080.0%0.0%35800 ms156 n=1515
26gemini-3.1-pro-previewcopilot0.94073.3%0.0%26317 ms54 n=1515
27glm-5.1
hf:zai-org/GLM-5.1
synthetic0.93976.7%0.0%43267 ms117 n=3030
28gpt-oss-120b
hf:openai/gpt-oss-120b
synthetic0.92883.3%0.0%37666 ms124 n=3030
29poolside/laguna-m.1:freeopenrouter0.92773.3%0.0%936 ms3439 n=1515
30gpt-5.4openai0.92780.0%0.0%69424 ms51 n=1515
31glm-5.1zai0.92180.0%0.0%92423 ms44 n=3030
32gemini-2.5-procopilot0.92066.7%0.0%35868 ms82 n=3030
33gpt-4.1copilot0.91980.0%0.0%18585 ms91 n=1515
34deepseek-v3.2
deepseek-chat
deepseek0.91780.0%0.0%299 ms9106 n=3030
35grok-code-fast-1copilot0.91586.7%0.0%46233 ms42 n=1515
36minimax-m2.5
hf:MiniMaxAI/MiniMax-M2.5
synthetic0.91183.3%0.0%27907 ms125 n=3030
37deepseek-r1
deepseek-reasoner
deepseek0.90673.3%0.0%302 ms12802 n=3030
38gpt-5.4-mini-fastopenai0.90580.0%0.0%19234 ms141 n=1515
39kimi-k2.5alibaba0.89883.3%3.3%93816 ms38 n=3030
40qwen3.6-27bcrofai0.89573.3%6.7%38313 ms139 n=1515
41gemini-3-flash
gemini-3-flash-preview
copilot0.89580.0%0.0%12003 ms143 n=3030
42minimax-m2.1
MiniMax-M2.1
minimax0.89483.3%0.0%37422 ms177 n=3030
43claude-haiku-4.5
claude-haiku-4-5-20251001
claude0.89377.4%0.0%41600 ms120 n=3131
44claude-sonnet-4.6
claude-sonnet-4-6
claude0.89180.0%0.0%90807 ms58 n=3030
45claude-haiku-4.5copilot0.88773.3%0.0%42931 ms119 n=3030
46qwen3-coder-nextalibaba0.88780.0%0.0%23946 ms137 n=1515
47minimax-m2
MiniMax-M2
minimax0.88676.7%0.0%48145 ms80 n=3030
48poolside/laguna-xs.2:freeopenrouter0.88380.0%0.0%519 ms4916 n=1515
49gemma-4-31b-itcrofai0.87680.0%0.0%37773 ms106 n=1515
50qwen3-coder-plusalibaba0.87273.3%0.0%27180 ms49 n=1515
51minimax-m2.5
MiniMax-M2.5MiniMax-M2.5-highspeed
minimax0.85571.7%1.7%59382 ms62 n=6060
52deepseek-v4-procrofai0.85466.7%3.3%58256 ms65 n=2730
53glm-5.1
glm-5.1-precision
crofai0.84870.0%0.0%65061 ms62 n=5360
54deepseek-v3.2
hf:deepseek-ai/DeepSeek-V3.2
synthetic0.84773.3%0.0%52751 ms73 n=3030
55hf:zai-org/GLM-4.7-Flashsynthetic0.83766.7%0.0%29875 ms171 n=1515
56glm-4.5-airzai0.82560.0%6.7%85756 ms62 n=1515
57claude-opus-4.5
claude-opus-4-5
claude0.82173.3%0.0%77058 ms71 n=3030
58kimi-k2.5
kimi-k2.5-lightning
crofai0.82165.0%0.0%32411 ms128 n=5760
59deepseek-v4-flashdeepseek0.82070.0%3.3%297 ms12566 n=3030
60qwen3-coder-480b
hf:Qwen/Qwen3-Coder-480B-A35B-Instruct
synthetic0.81970.0%0.0%38899 ms46 n=3030
61minimax-m2.7
MiniMax-M2.7MiniMax-M2.7-highspeed
minimax0.80763.3%5.0%106174 ms49 n=5760
62qwen3.5-9b-chatcrofai0.80373.3%0.0%50475 ms66 n=1315
63kimi-k2.6
kimi-k2.6-precision
crofai0.79165.0%0.0%57561 ms77 n=4960
64gpt-4ocopilot0.76460.0%0.0%14253 ms94 n=1515
65deepseek-r1
hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528
synthetic0.75368.3%0.0%33760 ms118 n=4860
66z-ai/glm-4.5-air:freeopenrouter0.75160.0%0.0%5389 ms944 n=1415
67nvidia/nemotron-nano-9b-v2:freeopenrouter0.74760.0%0.0%575 ms4029 n=1515
68deepseek-v4-prodeepseek0.70753.3%20.0%291 ms15079 n=3030
69glm-4.7zai0.69960.0%0.0%188863 ms26 n=1115
70minimax-m2.5crofai0.69456.7%3.3%51840 ms91 n=2530
71hf:moonshotai/Kimi-K2.6synthetic0.67260.0%20.0%55146 ms143 n=1515
72deepseek-v3.2crofai0.63550.0%3.3%63967 ms63 n=2230
73nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:freeopenrouter0.60440.0%26.7%10593 ms6074 n=1415
74hf:meta-llama/Llama-3.3-70B-Instructsynthetic0.59520.0%0.0%21937 ms63 n=1515
75qwen3.5-9bcrofai0.57640.0%13.3%94939 ms87 n=915
76llama3.1-8bcerebras0.5460.0%0.0%605 ms1695 n=1515
77deepseek-v3
hf:deepseek-ai/DeepSeek-V3hf:deepseek-ai/DeepSeek-V3-0324
synthetic0.49438.3%0.0%22097 ms33 n=3060
78kimi-k2.5
hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4
synthetic0.45728.3%5.0%126275 ms53 n=2960
79qwen-3-235b-a22b-instruct-2507cerebras0.40333.3%0.0%1220 ms726 n=515
80liquid/lfm-2.5-1.2b-thinking:freeopenrouter0.3910.0%6.7%749 ms3584 n=1515
81gpt-5.2-codexcopilot0.34126.7%0.0%404 ms15
82hf:meta-llama/Llama-3.1-70B-Instructsynthetic0.2666.7%0.0%347 ms15
83minimax-m2
hf:MiniMaxAI/MiniMax-M2
synthetic0.2546.7%0.0%350 ms30
84deepseek-v3.1
hf:deepseek-ai/DeepSeek-V3.1hf:deepseek-ai/DeepSeek-V3.1-Terminus
synthetic0.1480.0%0.0%487 ms60
85kimi-k2
hf:moonshotai/Kimi-K2-Instruct-0905
synthetic0.1480.0%0.0%344 ms30
86kimi-k2-thinking
hf:moonshotai/Kimi-K2-Thinking
synthetic0.1480.0%0.0%333 ms30
87minimax-m2.1
hf:MiniMaxAI/MiniMax-M2.1
synthetic0.1480.0%0.0%344 ms30
88gpt-5.4copilot0.1480.0%0.0%512 ms30
89gpt-5.4-minicopilot0.1480.0%0.0%358 ms30
90claude-opus-4.5copilot0.1480.0%0.0%360 ms30
91qwen3-235b
hf:Qwen/Qwen3-235B-A22B-Instruct-2507
synthetic0.1480.0%0.0%338 ms30
92llama-4-maverick
hf:meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
synthetic0.1480.0%0.0%340 ms30
93gpt-oss-120bcerebras0.1480.0%0.0%144 ms15
94zai-glm-4.7cerebras0.1480.0%0.0%129 ms15
95glm-4.7crofai0.1480.0%0.0%910 ms15
96glm-4.7-flashcrofai0.1480.0%0.0%469 ms15
97gpt-5-codexopenai0.1480.0%0.0%196 ms15
98gpt-5.1-codexopenai0.1480.0%0.0%193 ms15
99gpt-5.1-codex-maxopenai0.1480.0%0.0%181 ms15
100gpt-5.1-codex-miniopenai0.1480.0%0.0%173 ms15
101glm-5v-turbozai0.1480.0%0.0%723 ms15
102glm-5crofai0.1480.0%0.0%604 ms15
103gpt-5.2-codexopenai0.1480.0%0.0%217 ms15
104gpt-5.3-codex-sparkopenai0.1480.0%0.0%195 ms15
105hf:meta-llama/Llama-3.1-405B-Instructsynthetic0.1480.0%0.0%350 ms15
106gregcrofai0.1480.0%0.0%608 ms15
107hf:meta-llama/Llama-4-Scout-17B-16E-Instructsynthetic0.1480.0%0.0%332 ms15
108gpt-5.3-codexcopilot0.1480.0%0.0%381 ms15
109gpt-5.5-proopenai0.1480.0%0.0%191 ms15
110qwen3.5-397b-a17bcrofai0.1480.0%0.0%512 ms15
111hf:Qwen/Qwen2.5-Coder-32B-Instructsynthetic0.1480.0%0.0%341 ms15
112hf:Qwen/Qwen3-235B-A22B-Thinking-2507synthetic0.1480.0%0.0%867 ms15
113hf:zai-org/GLM-4.6synthetic0.1480.0%0.0%345 ms15

Developer — score by category

Architect — score by category

Cross-provider drift

ModelRoleBaselineProviderAcc baseAcc provΔ acczExactSemantic sim
glm-5.1architectzaisynthetic
hf:zai-org/GLM-5.1
80.0%76.7%-3.3%-0.310.0%
glm-5.1architectzaicrofai
glm-5.1-precision
80.0%70.0%-10.0%-1.010.0%
glm-5.1developerzaisynthetic
hf:zai-org/GLM-5.1
72.2%83.3%+11.1%+1.135.6%
glm-5.1developerzaicrofai
glm-5.1-precision
72.2%79.2%+6.9%+0.811.4%
deepseek-v4-proarchitectdeepseekcrofai53.3%66.7%+13.3%+1.050.0%
deepseek-v4-prodeveloperdeepseekcrofai75.0%72.2%-2.8%-0.275.6%
minimax-m2architectminimax
MiniMax-M2
synthetic
hf:MiniMaxAI/MiniMax-M2
76.7%6.7%-70.0%-5.500.0%
deepseek-v3.2architectsynthetic
hf:deepseek-ai/DeepSeek-V3.2
deepseek
deepseek-chat
73.3%80.0%+6.7%+0.610.0%
deepseek-v3.2architectsynthetic
hf:deepseek-ai/DeepSeek-V3.2
crofai73.3%50.0%-23.3%-1.860.0%
kimi-k2.5architectsynthetic
hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4
crofai
kimi-k2.5-lightning
28.3%65.0%+36.7%+4.031.7%
kimi-k2.5architectsynthetic
hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4
alibaba28.3%83.3%+55.0%+4.933.3%
deepseek-v3.2developersynthetic
hf:deepseek-ai/DeepSeek-V3.2
deepseek
deepseek-chat
75.0%75.0%+0.0%+0.002.8%
deepseek-v3.2developersynthetic
hf:deepseek-ai/DeepSeek-V3.2
crofai75.0%52.8%-22.2%-1.965.6%
minimax-m2developerminimax
MiniMax-M2
synthetic
hf:MiniMaxAI/MiniMax-M2
55.6%0.0%-55.6%-5.260.0%
deepseek-r1architectsynthetic
hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528
deepseek
deepseek-reasoner
68.3%73.3%+5.0%+0.490.0%
kimi-k2.5developersynthetic
hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4
crofai
kimi-k2.5-lightning
29.2%52.8%+23.6%+2.8826.4%
kimi-k2.5developersynthetic
hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4
alibaba29.2%83.3%+54.2%+5.325.6%
minimax-m2.1architectminimax
MiniMax-M2.1
synthetic
hf:MiniMaxAI/MiniMax-M2.1
83.3%0.0%-83.3%-6.550.0%
deepseek-r1developersynthetic
hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528
deepseek
deepseek-reasoner
48.6%80.6%+31.9%+3.180.0%
minimax-m2.1developerminimax
MiniMax-M2.1
synthetic
hf:MiniMaxAI/MiniMax-M2.1
58.3%0.0%-58.3%-5.445.6%
minimax-m2.5architectminimax
MiniMax-M2.5MiniMax-M2.5-highspeed
crofai71.7%56.7%-15.0%-1.420.0%
minimax-m2.5architectminimax
MiniMax-M2.5MiniMax-M2.5-highspeed
synthetic
hf:MiniMaxAI/MiniMax-M2.5
71.7%83.3%+11.7%+1.210.0%
minimax-m2.5developerminimax
MiniMax-M2.5MiniMax-M2.5-highspeed
crofai59.7%58.3%-1.4%-0.1413.9%
minimax-m2.5developerminimax
MiniMax-M2.5MiniMax-M2.5-highspeed
synthetic
hf:MiniMaxAI/MiniMax-M2.5
59.7%61.1%+1.4%+0.1411.1%

Score over time

Methodology & caveats

Each model is evaluated against a fixed suite of 35 prompts (20 developer, 15 architect) covering code generation from spec, bug fixing, refactoring, multi-file edits, tool use, architecture decision records, design, trade-off analysis and problem decomposition. Code prompts are scored by running a hidden pytest suite inside a Docker sandbox. Text prompts are scored by a combination of regex/contains verifiers and an LLM-as-judge step (Claude Opus 4.7).

Known judge bias. The LLM judge (Claude Opus) may favour responses that match its own style: more verbose, more structured, or following Anthropic's coding conventions. Text scores are indicative only; pytest sandbox results are the most reliable signal. Treat rankings as relative guidance, not ground truth.