Quality benchmarking beta
Alerts
| Severity | Kind | Model | Provider | Message |
|---|---|---|---|---|
| warning | drift | glm-5.1 | crofai | glm-5.1 on crofai is 10.0% behind zai (z=-1.01) |
| critical | drift | minimax-m2 | synthetic | minimax-m2 on synthetic is 70.0% behind minimax (z=-5.50) |
| warning | drift | deepseek-v3.2 | crofai | deepseek-v3.2 on crofai is 23.3% behind synthetic (z=-1.86) |
| warning | drift | deepseek-v3.2 | crofai | deepseek-v3.2 on crofai is 22.2% behind synthetic (z=-1.96) |
| warning | refusal_spike | deepseek-v3.2 | crofai | refusal rate +16.7% vs synthetic |
| critical | drift | minimax-m2 | synthetic | minimax-m2 on synthetic is 55.6% behind minimax (z=-5.26) |
| critical | drift | minimax-m2.1 | synthetic | minimax-m2.1 on synthetic is 83.3% behind minimax (z=-6.55) |
| critical | drift | minimax-m2.1 | synthetic | minimax-m2.1 on synthetic is 58.3% behind minimax (z=-5.44) |
| warning | drift | minimax-m2.5 | crofai | minimax-m2.5 on crofai is 15.0% behind minimax (z=-1.42) |
Developer ranking
| # | Model | Provider | Score | Pass | Refusal | Latency | Real tok/s | Runs |
|---|---|---|---|---|---|---|---|---|
| 1 | gpt-4o | copilot | 0.901 | 75.0% | 0.0% | 5971 ms | 79 n=16 | 16 |
| 2 | glm-5.1 hf:zai-org/GLM-5.1 | synthetic | 0.898 | 83.3% | 0.0% | 18047 ms | 127 n=32 | 36 |
| 3 | kimi-k2.5 | alibaba | 0.898 | 83.3% | 0.0% | 23477 ms | 34 n=32 | 36 |
| 4 | qwen3-coder-480b hf:Qwen/Qwen3-Coder-480B-A35B-Instruct | synthetic | 0.898 | 83.3% | 0.0% | 15965 ms | 45 n=32 | 36 |
| 5 | gpt-oss-120b hf:openai/gpt-oss-120b | synthetic | 0.898 | 83.3% | 0.0% | 6736 ms | 134 n=32 | 36 |
| 6 | deepseek-r1 deepseek-reasoner | deepseek | 0.886 | 80.6% | 0.0% | 2701 ms | 2268 n=32 | 36 |
| 7 | gpt-5.2 | openai | 0.884 | 81.2% | 0.0% | 7798 ms | 45 n=16 | 16 |
| 8 | glm-5 | alibaba | 0.884 | 81.2% | 0.0% | 77372 ms | 46 n=16 | 16 |
| 9 | gpt-5.3-codex | openai | 0.884 | 81.2% | 0.0% | 8194 ms | 46 n=16 | 16 |
| 10 | gemini-3.1-pro-preview | copilot | 0.884 | 81.2% | 0.0% | 12277 ms | 26 n=16 | 16 |
| 11 | qwen3-max-2026-01-23 | alibaba | 0.884 | 81.2% | 0.0% | 10918 ms | 26 n=16 | 16 |
| 12 | gpt-4.1 | copilot | 0.884 | 81.2% | 0.0% | 4708 ms | 69 n=16 | 16 |
| 13 | gpt-5.4-mini | openai | 0.884 | 81.2% | 0.0% | 3616 ms | 95 n=16 | 16 |
| 14 | gpt-5.2 | copilot | 0.884 | 81.2% | 0.0% | 3994 ms | 74 n=16 | 16 |
| 15 | gpt-5.4-mini-fast | openai | 0.884 | 81.2% | 0.0% | 3596 ms | 86 n=16 | 16 |
| 16 | claude-sonnet-4.5 claude-sonnet-4-5 | claude | 0.878 | 75.0% | 0.0% | 6871 ms | 67 n=16 | 16 |
| 17 | claude-sonnet-4.5 | copilot | 0.878 | 75.0% | 0.0% | 7456 ms | 66 n=16 | 16 |
| 18 | hf:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | synthetic | 0.876 | 75.0% | 0.0% | 6989 ms | 122 n=16 | 16 |
| 19 | hf:meta-llama/Llama-3.3-70B-Instruct | synthetic | 0.871 | 75.0% | 0.0% | 7006 ms | 77 n=16 | 16 |
| 20 | deepseek-v3.2 deepseek-chat | deepseek | 0.866 | 75.0% | 0.0% | 2122 ms | 1098 n=32 | 36 |
| 21 | glm-5.1 glm-5.1-precision | crofai | 0.863 | 79.2% | 0.0% | 34600 ms | 56 n=63 | 72 |
| 22 | gpt-5.4-fast | openai | 0.848 | 75.0% | 0.0% | 6089 ms | 40 n=16 | 16 |
| 23 | claude-sonnet-4.6 claude-sonnet-4-6 | claude | 0.838 | 75.0% | 0.0% | 8767 ms | 67 n=32 | 36 |
| 24 | qwen3.5-plus | alibaba | 0.830 | 75.0% | 0.0% | 82819 ms | 54 n=15 | 16 |
| 25 | grok-code-fast-1 | copilot | 0.830 | 75.0% | 0.0% | 29713 ms | 10 n=16 | 16 |
| 26 | openai/gpt-oss-120b:free | openrouter | 0.824 | 68.8% | 0.0% | 963 ms | 547 n=16 | 16 |
| 27 | gemini-2.5-pro | copilot | 0.822 | 75.0% | 0.0% | 10567 ms | 48 n=32 | 36 |
| 28 | glm-5.1 | zai | 0.819 | 72.2% | 2.8% | 37501 ms | 57 n=32 | 36 |
| 29 | deepseek-v4-pro | deepseek | 0.818 | 75.0% | 8.3% | 5806 ms | 3026 n=32 | 36 |
| 30 | deepseek-v4-flash | deepseek | 0.810 | 72.2% | 8.3% | 3766 ms | 3133 n=32 | 36 |
| 31 | gemini-3-flash gemini-3-flash-preview | copilot | 0.806 | 72.2% | 2.8% | 7486 ms | 44 n=32 | 36 |
| 32 | claude-sonnet-4.6 | copilot | 0.806 | 72.2% | 0.0% | 3608 ms | 63 n=32 | 36 |
| 33 | gpt-5.4 | openai | 0.800 | 75.0% | 0.0% | 8211 ms | 38 n=16 | 16 |
| 34 | claude-opus-4.7 claude-opus-4-7 | claude | 0.799 | 69.4% | 0.0% | 5421 ms | 80 n=32 | 36 |
| 35 | kimi-k2.6 kimi-k2.6-precision | crofai | 0.798 | 73.6% | 2.8% | 25284 ms | 57 n=62 | 72 |
| 36 | claude-haiku-4.5 | copilot | 0.790 | 69.4% | 0.0% | 4068 ms | 131 n=32 | 36 |
| 37 | glm-4.7 | zai | 0.775 | 75.0% | 6.2% | 75511 ms | 41 n=16 | 16 |
| 38 | glm-5-turbo | zai | 0.775 | 75.0% | 6.2% | 34618 ms | 63 n=16 | 16 |
| 39 | MiniMax-M2.5 | alibaba | 0.775 | 68.8% | 0.0% | 53205 ms | 59 n=16 | 16 |
| 40 | qwen3-coder-next | alibaba | 0.775 | 68.8% | 0.0% | 3448 ms | 127 n=16 | 16 |
| 41 | qwen3-coder-plus | alibaba | 0.775 | 68.8% | 0.0% | 4059 ms | 60 n=16 | 16 |
| 42 | hf:zai-org/GLM-5 | synthetic | 0.775 | 75.0% | 6.2% | 33968 ms | 118 n=16 | 16 |
| 43 | claude-haiku-4.5 claude-haiku-4-5-20251001 | claude | 0.774 | 66.7% | 0.0% | 3539 ms | 146 n=32 | 36 |
| 44 | claude-opus-4.5 claude-opus-4-5 | claude | 0.774 | 66.7% | 0.0% | 5790 ms | 72 n=32 | 36 |
| 45 | gpt-5.5 | openai | 0.764 | 68.8% | 0.0% | 17303 ms | 56 n=16 | 16 |
| 46 | deepseek-v3.2 hf:deepseek-ai/DeepSeek-V3.2 | synthetic | 0.762 | 75.0% | 2.8% | 26285 ms | 76 n=32 | 36 |
| 47 | qwen3.5-9b-chat | crofai | 0.759 | 50.0% | 0.0% | 7602 ms | 60 n=16 | 16 |
| 48 | gemma-4-31b-it | crofai | 0.751 | 62.5% | 6.2% | 25919 ms | 95 n=16 | 16 |
| 49 | glm-4.7 | alibaba | 0.748 | 70.6% | 0.0% | 51912 ms | 73 n=16 | 17 |
| 50 | glm-4.5-air | zai | 0.739 | 68.8% | 12.5% | 38974 ms | 69 n=16 | 16 |
| 51 | deepseek-v4-pro | crofai | 0.738 | 72.2% | 2.8% | 42224 ms | 58 n=30 | 36 |
| 52 | hf:zai-org/GLM-4.7-Flash | synthetic | 0.737 | 56.2% | 6.2% | 18242 ms | 168 n=16 | 16 |
| 53 | z-ai/glm-4.5-air:free | openrouter | 0.720 | 68.8% | 6.2% | 15818 ms | 494 n=15 | 16 |
| 54 | qwen3.6-27b | crofai | 0.711 | 62.5% | 0.0% | 25853 ms | 111 n=16 | 16 |
| 55 | poolside/laguna-xs.2:free | openrouter | 0.705 | 50.0% | 6.2% | 462 ms | 1824 n=16 | 16 |
| 56 | qwen3.6-plus | alibaba | 0.684 | 68.8% | 0.0% | 119876 ms | 54 n=13 | 16 |
| 57 | minimax-m2.5 | crofai | 0.679 | 58.3% | 22.2% | 30032 ms | 82 n=32 | 36 |
| 58 | minimax-m2.7 MiniMax-M2.7MiniMax-M2.7-highspeed | minimax | 0.673 | 63.9% | 13.9% | 37635 ms | 59 n=64 | 72 |
| 59 | qwen3.5-9b | crofai | 0.668 | 50.0% | 12.5% | 25253 ms | 96 n=16 | 16 |
| 60 | poolside/laguna-m.1:free | openrouter | 0.666 | 56.2% | 12.5% | 936 ms | 1801 n=16 | 16 |
| 61 | minimax-m2.1 MiniMax-M2.1 | minimax | 0.658 | 58.3% | 19.4% | 13848 ms | 193 n=32 | 36 |
| 62 | nvidia/nemotron-nano-9b-v2:free | openrouter | 0.648 | 56.2% | 12.5% | 484 ms | 5254 n=16 | 16 |
| 63 | minimax-m2.5 MiniMax-M2.5MiniMax-M2.5-highspeed | minimax | 0.648 | 59.7% | 19.4% | 25584 ms | 70 n=64 | 72 |
| 64 | hf:Qwen/Qwen3.5-397B-A17B | synthetic | 0.642 | 50.0% | 6.2% | 28148 ms | 129 n=16 | 16 |
| 65 | minimax-m2 MiniMax-M2 | minimax | 0.637 | 55.6% | 19.4% | 19294 ms | 107 n=32 | 36 |
| 66 | hf:zai-org/GLM-4.7 | synthetic | 0.630 | 62.5% | 6.2% | 23881 ms | 168 n=16 | 16 |
| 67 | minimax-m2.5 hf:MiniMaxAI/MiniMax-M2.5 | synthetic | 0.626 | 61.1% | 22.2% | 16525 ms | 127 n=32 | 36 |
| 68 | deepseek-v3.2 | crofai | 0.623 | 52.8% | 19.4% | 52041 ms | 95 n=30 | 36 |
| 69 | kimi-k2.5 kimi-k2.5-lightning | crofai | 0.603 | 52.8% | 9.7% | 35626 ms | 118 n=60 | 72 |
| 70 | gpt-5.5-fast | openai | 0.545 | 50.0% | 0.0% | 11374 ms | 51 n=15 | 16 |
| 71 | deepseek-r1 hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528 | synthetic | 0.531 | 48.6% | 4.2% | 22601 ms | 112 n=49 | 72 |
| 72 | gpt-5-mini | copilot | 0.518 | 50.0% | 0.0% | 12403 ms | 72 n=16 | 16 |
| 73 | nvidia/nemotron-3-nano-30b-a3b:free | openrouter | 0.464 | 37.5% | 0.0% | 586 ms | 1758 n=16 | 16 |
| 74 | deepseek-v3 hf:deepseek-ai/DeepSeek-V3hf:deepseek-ai/DeepSeek-V3-0324 | synthetic | 0.439 | 40.3% | 0.0% | 11168 ms | 38 n=32 | 72 |
| 75 | llama3.1-8b | cerebras | 0.420 | 31.2% | 0.0% | 301 ms | 1094 n=11 | 16 |
| 76 | kimi-k2.5 hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4 | synthetic | 0.315 | 29.2% | 16.7% | 77458 ms | 34 n=35 | 72 |
| 77 | nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free | openrouter | 0.263 | 12.5% | 50.0% | 729 ms | 1174 n=16 | 16 |
| 78 | liquid/lfm-2.5-1.2b-thinking:free | openrouter | 0.256 | 12.5% | 18.8% | 498 ms | 6682 n=16 | 16 |
| 79 | hf:moonshotai/Kimi-K2.6 | synthetic | 0.118 | 6.2% | 12.5% | 20255 ms | 162 n=16 | 16 |
| 80 | qwen-3-235b-a22b-instruct-2507 | cerebras | 0.036 | 6.2% | 0.0% | 172 ms | 635 n=1 | 16 |
| 81 | deepseek-v3.1 hf:deepseek-ai/DeepSeek-V3.1hf:deepseek-ai/DeepSeek-V3.1-Terminus | synthetic | 0.028 | 0.0% | 0.0% | 337 ms | — | 72 |
| 82 | kimi-k2 hf:moonshotai/Kimi-K2-Instruct-0905 | synthetic | 0.028 | 0.0% | 0.0% | 337 ms | — | 36 |
| 83 | kimi-k2-thinking hf:moonshotai/Kimi-K2-Thinking | synthetic | 0.028 | 0.0% | 0.0% | 326 ms | — | 36 |
| 84 | minimax-m2 hf:MiniMaxAI/MiniMax-M2 | synthetic | 0.028 | 0.0% | 0.0% | 330 ms | — | 36 |
| 85 | minimax-m2.1 hf:MiniMaxAI/MiniMax-M2.1 | synthetic | 0.028 | 0.0% | 0.0% | 326 ms | — | 36 |
| 86 | gpt-5.4 | copilot | 0.028 | 0.0% | 0.0% | 527 ms | — | 36 |
| 87 | gpt-5.4-mini | copilot | 0.028 | 0.0% | 0.0% | 351 ms | — | 36 |
| 88 | claude-opus-4.5 | copilot | 0.028 | 0.0% | 0.0% | 350 ms | — | 36 |
| 89 | qwen3-235b hf:Qwen/Qwen3-235B-A22B-Instruct-2507 | synthetic | 0.028 | 0.0% | 0.0% | 335 ms | — | 36 |
| 90 | llama-4-maverick hf:meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | synthetic | 0.028 | 0.0% | 0.0% | 329 ms | — | 36 |
| 91 | gpt-oss-120b | cerebras | 0.000 | 0.0% | 0.0% | 127 ms | — | 16 |
| 92 | zai-glm-4.7 | cerebras | 0.000 | 0.0% | 0.0% | 129 ms | — | 16 |
| 93 | glm-4.7 | crofai | 0.000 | 0.0% | 0.0% | 848 ms | — | 16 |
| 94 | glm-4.7-flash | crofai | 0.000 | 0.0% | 0.0% | 455 ms | — | 16 |
| 95 | gpt-5-codex | openai | 0.000 | 0.0% | 0.0% | 192 ms | — | 16 |
| 96 | gpt-5.1-codex | openai | 0.000 | 0.0% | 0.0% | 164 ms | — | 16 |
| 97 | gpt-5.1-codex-max | openai | 0.000 | 0.0% | 0.0% | 171 ms | — | 16 |
| 98 | gpt-5.1-codex-mini | openai | 0.000 | 0.0% | 0.0% | 288 ms | — | 16 |
| 99 | glm-5v-turbo | zai | 0.000 | 0.0% | 0.0% | 689 ms | — | 16 |
| 100 | glm-5 | crofai | 0.000 | 0.0% | 0.0% | 598 ms | — | 16 |
| 101 | gpt-5.2-codex | openai | 0.000 | 0.0% | 0.0% | 181 ms | — | 16 |
| 102 | gpt-5.3-codex-spark | openai | 0.000 | 0.0% | 0.0% | 186 ms | — | 16 |
| 103 | hf:meta-llama/Llama-3.1-405B-Instruct | synthetic | 0.000 | 0.0% | 0.0% | 349 ms | — | 16 |
| 104 | hf:meta-llama/Llama-3.1-70B-Instruct | synthetic | 0.000 | 0.0% | 0.0% | 331 ms | — | 16 |
| 105 | greg | crofai | 0.000 | 0.0% | 0.0% | 602 ms | — | 16 |
| 106 | hf:meta-llama/Llama-4-Scout-17B-16E-Instruct | synthetic | 0.000 | 0.0% | 0.0% | 329 ms | — | 16 |
| 107 | gpt-5.2-codex | copilot | 0.000 | 0.0% | 0.0% | 371 ms | — | 16 |
| 108 | gpt-5.3-codex | copilot | 0.000 | 0.0% | 0.0% | 371 ms | — | 16 |
| 109 | gpt-5.5-pro | openai | 0.000 | 0.0% | 0.0% | 166 ms | — | 16 |
| 110 | qwen3.5-397b-a17b | crofai | 0.000 | 0.0% | 0.0% | 544 ms | — | 16 |
| 111 | hf:Qwen/Qwen2.5-Coder-32B-Instruct | synthetic | 0.000 | 0.0% | 0.0% | 321 ms | — | 16 |
| 112 | hf:Qwen/Qwen3-235B-A22B-Thinking-2507 | synthetic | 0.000 | 0.0% | 0.0% | 788 ms | — | 16 |
| 113 | hf:zai-org/GLM-4.6 | synthetic | 0.000 | 0.0% | 0.0% | 329 ms | — | 16 |
Architect ranking
| # | Model | Provider | Score | Pass | Refusal | Latency | Real tok/s | Runs |
|---|---|---|---|---|---|---|---|---|
| 1 | gpt-5.2 | openai | 0.990 | 93.3% | 0.0% | 50349 ms | 54 n=15 | 15 |
| 2 | openai/gpt-oss-120b:free | openrouter | 0.988 | 93.3% | 0.0% | 1355 ms | 3293 n=15 | 15 |
| 3 | gpt-5.5-fast | openai | 0.985 | 86.7% | 0.0% | 80291 ms | 61 n=15 | 15 |
| 4 | gpt-5-mini | copilot | 0.985 | 86.7% | 0.0% | 53603 ms | 89 n=15 | 15 |
| 5 | claude-sonnet-4.5 | copilot | 0.982 | 93.3% | 0.0% | 86891 ms | 64 n=15 | 15 |
| 6 | glm-5-turbo | zai | 0.976 | 80.0% | 0.0% | 150913 ms | 28 n=15 | 15 |
| 7 | nvidia/nemotron-3-nano-30b-a3b:free | openrouter | 0.971 | 86.7% | 0.0% | 667 ms | 6775 n=15 | 15 |
| 8 | gpt-5.3-codex | openai | 0.969 | 80.0% | 0.0% | 33636 ms | 53 n=15 | 15 |
| 9 | gpt-5.2 | copilot | 0.969 | 86.7% | 0.0% | 46512 ms | 59 n=15 | 15 |
| 10 | claude-opus-4.7 claude-opus-4-7 | claude | 0.967 | 80.0% | 0.0% | 68064 ms | 57 n=30 | 30 |
| 11 | claude-sonnet-4.5 claude-sonnet-4-5 | claude | 0.965 | 80.0% | 0.0% | 105146 ms | 52 n=15 | 15 |
| 12 | hf:nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 | synthetic | 0.964 | 86.7% | 0.0% | 37711 ms | 133 n=15 | 15 |
| 13 | qwen3.5-plus | alibaba | 0.962 | 66.7% | 0.0% | 95567 ms | 54 n=15 | 15 |
| 14 | MiniMax-M2.5 | alibaba | 0.962 | 80.0% | 0.0% | 72432 ms | 62 n=15 | 15 |
| 15 | hf:Qwen/Qwen3.5-397B-A17B | synthetic | 0.961 | 80.0% | 0.0% | 56486 ms | 84 n=15 | 15 |
| 16 | hf:zai-org/GLM-5 | synthetic | 0.957 | 80.0% | 0.0% | 55637 ms | 95 n=15 | 15 |
| 17 | qwen3.6-plus | alibaba | 0.955 | 80.0% | 0.0% | 107257 ms | 54 n=15 | 15 |
| 18 | glm-4.7 | alibaba | 0.954 | 86.7% | 0.0% | 82250 ms | 66 n=30 | 30 |
| 19 | qwen3-max-2026-01-23 | alibaba | 0.952 | 73.3% | 0.0% | 64856 ms | 32 n=15 | 15 |
| 20 | gpt-5.4-fast | openai | 0.946 | 80.0% | 0.0% | 82671 ms | 48 n=14 | 15 |
| 21 | gpt-5.5 | openai | 0.943 | 86.7% | 0.0% | 78771 ms | 59 n=14 | 15 |
| 22 | gpt-5.4-mini | openai | 0.941 | 80.0% | 0.0% | 18936 ms | 144 n=15 | 15 |
| 23 | claude-sonnet-4.6 | copilot | 0.941 | 93.3% | 0.0% | 82968 ms | 57 n=30 | 30 |
| 24 | glm-5 | alibaba | 0.940 | 80.0% | 0.0% | 128330 ms | 37 n=15 | 15 |
| 25 | hf:zai-org/GLM-4.7 | synthetic | 0.940 | 80.0% | 0.0% | 35800 ms | 156 n=15 | 15 |
| 26 | gemini-3.1-pro-preview | copilot | 0.940 | 73.3% | 0.0% | 26317 ms | 54 n=15 | 15 |
| 27 | glm-5.1 hf:zai-org/GLM-5.1 | synthetic | 0.939 | 76.7% | 0.0% | 43267 ms | 117 n=30 | 30 |
| 28 | gpt-oss-120b hf:openai/gpt-oss-120b | synthetic | 0.928 | 83.3% | 0.0% | 37666 ms | 124 n=30 | 30 |
| 29 | poolside/laguna-m.1:free | openrouter | 0.927 | 73.3% | 0.0% | 936 ms | 3439 n=15 | 15 |
| 30 | gpt-5.4 | openai | 0.927 | 80.0% | 0.0% | 69424 ms | 51 n=15 | 15 |
| 31 | glm-5.1 | zai | 0.921 | 80.0% | 0.0% | 92423 ms | 44 n=30 | 30 |
| 32 | gemini-2.5-pro | copilot | 0.920 | 66.7% | 0.0% | 35868 ms | 82 n=30 | 30 |
| 33 | gpt-4.1 | copilot | 0.919 | 80.0% | 0.0% | 18585 ms | 91 n=15 | 15 |
| 34 | deepseek-v3.2 deepseek-chat | deepseek | 0.917 | 80.0% | 0.0% | 299 ms | 9106 n=30 | 30 |
| 35 | grok-code-fast-1 | copilot | 0.915 | 86.7% | 0.0% | 46233 ms | 42 n=15 | 15 |
| 36 | minimax-m2.5 hf:MiniMaxAI/MiniMax-M2.5 | synthetic | 0.911 | 83.3% | 0.0% | 27907 ms | 125 n=30 | 30 |
| 37 | deepseek-r1 deepseek-reasoner | deepseek | 0.906 | 73.3% | 0.0% | 302 ms | 12802 n=30 | 30 |
| 38 | gpt-5.4-mini-fast | openai | 0.905 | 80.0% | 0.0% | 19234 ms | 141 n=15 | 15 |
| 39 | kimi-k2.5 | alibaba | 0.898 | 83.3% | 3.3% | 93816 ms | 38 n=30 | 30 |
| 40 | qwen3.6-27b | crofai | 0.895 | 73.3% | 6.7% | 38313 ms | 139 n=15 | 15 |
| 41 | gemini-3-flash gemini-3-flash-preview | copilot | 0.895 | 80.0% | 0.0% | 12003 ms | 143 n=30 | 30 |
| 42 | minimax-m2.1 MiniMax-M2.1 | minimax | 0.894 | 83.3% | 0.0% | 37422 ms | 177 n=30 | 30 |
| 43 | claude-haiku-4.5 claude-haiku-4-5-20251001 | claude | 0.893 | 77.4% | 0.0% | 41600 ms | 120 n=31 | 31 |
| 44 | claude-sonnet-4.6 claude-sonnet-4-6 | claude | 0.891 | 80.0% | 0.0% | 90807 ms | 58 n=30 | 30 |
| 45 | claude-haiku-4.5 | copilot | 0.887 | 73.3% | 0.0% | 42931 ms | 119 n=30 | 30 |
| 46 | qwen3-coder-next | alibaba | 0.887 | 80.0% | 0.0% | 23946 ms | 137 n=15 | 15 |
| 47 | minimax-m2 MiniMax-M2 | minimax | 0.886 | 76.7% | 0.0% | 48145 ms | 80 n=30 | 30 |
| 48 | poolside/laguna-xs.2:free | openrouter | 0.883 | 80.0% | 0.0% | 519 ms | 4916 n=15 | 15 |
| 49 | gemma-4-31b-it | crofai | 0.876 | 80.0% | 0.0% | 37773 ms | 106 n=15 | 15 |
| 50 | qwen3-coder-plus | alibaba | 0.872 | 73.3% | 0.0% | 27180 ms | 49 n=15 | 15 |
| 51 | minimax-m2.5 MiniMax-M2.5MiniMax-M2.5-highspeed | minimax | 0.855 | 71.7% | 1.7% | 59382 ms | 62 n=60 | 60 |
| 52 | deepseek-v4-pro | crofai | 0.854 | 66.7% | 3.3% | 58256 ms | 65 n=27 | 30 |
| 53 | glm-5.1 glm-5.1-precision | crofai | 0.848 | 70.0% | 0.0% | 65061 ms | 62 n=53 | 60 |
| 54 | deepseek-v3.2 hf:deepseek-ai/DeepSeek-V3.2 | synthetic | 0.847 | 73.3% | 0.0% | 52751 ms | 73 n=30 | 30 |
| 55 | hf:zai-org/GLM-4.7-Flash | synthetic | 0.837 | 66.7% | 0.0% | 29875 ms | 171 n=15 | 15 |
| 56 | glm-4.5-air | zai | 0.825 | 60.0% | 6.7% | 85756 ms | 62 n=15 | 15 |
| 57 | claude-opus-4.5 claude-opus-4-5 | claude | 0.821 | 73.3% | 0.0% | 77058 ms | 71 n=30 | 30 |
| 58 | kimi-k2.5 kimi-k2.5-lightning | crofai | 0.821 | 65.0% | 0.0% | 32411 ms | 128 n=57 | 60 |
| 59 | deepseek-v4-flash | deepseek | 0.820 | 70.0% | 3.3% | 297 ms | 12566 n=30 | 30 |
| 60 | qwen3-coder-480b hf:Qwen/Qwen3-Coder-480B-A35B-Instruct | synthetic | 0.819 | 70.0% | 0.0% | 38899 ms | 46 n=30 | 30 |
| 61 | minimax-m2.7 MiniMax-M2.7MiniMax-M2.7-highspeed | minimax | 0.807 | 63.3% | 5.0% | 106174 ms | 49 n=57 | 60 |
| 62 | qwen3.5-9b-chat | crofai | 0.803 | 73.3% | 0.0% | 50475 ms | 66 n=13 | 15 |
| 63 | kimi-k2.6 kimi-k2.6-precision | crofai | 0.791 | 65.0% | 0.0% | 57561 ms | 77 n=49 | 60 |
| 64 | gpt-4o | copilot | 0.764 | 60.0% | 0.0% | 14253 ms | 94 n=15 | 15 |
| 65 | deepseek-r1 hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528 | synthetic | 0.753 | 68.3% | 0.0% | 33760 ms | 118 n=48 | 60 |
| 66 | z-ai/glm-4.5-air:free | openrouter | 0.751 | 60.0% | 0.0% | 5389 ms | 944 n=14 | 15 |
| 67 | nvidia/nemotron-nano-9b-v2:free | openrouter | 0.747 | 60.0% | 0.0% | 575 ms | 4029 n=15 | 15 |
| 68 | deepseek-v4-pro | deepseek | 0.707 | 53.3% | 20.0% | 291 ms | 15079 n=30 | 30 |
| 69 | glm-4.7 | zai | 0.699 | 60.0% | 0.0% | 188863 ms | 26 n=11 | 15 |
| 70 | minimax-m2.5 | crofai | 0.694 | 56.7% | 3.3% | 51840 ms | 91 n=25 | 30 |
| 71 | hf:moonshotai/Kimi-K2.6 | synthetic | 0.672 | 60.0% | 20.0% | 55146 ms | 143 n=15 | 15 |
| 72 | deepseek-v3.2 | crofai | 0.635 | 50.0% | 3.3% | 63967 ms | 63 n=22 | 30 |
| 73 | nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free | openrouter | 0.604 | 40.0% | 26.7% | 10593 ms | 6074 n=14 | 15 |
| 74 | hf:meta-llama/Llama-3.3-70B-Instruct | synthetic | 0.595 | 20.0% | 0.0% | 21937 ms | 63 n=15 | 15 |
| 75 | qwen3.5-9b | crofai | 0.576 | 40.0% | 13.3% | 94939 ms | 87 n=9 | 15 |
| 76 | llama3.1-8b | cerebras | 0.546 | 0.0% | 0.0% | 605 ms | 1695 n=15 | 15 |
| 77 | deepseek-v3 hf:deepseek-ai/DeepSeek-V3hf:deepseek-ai/DeepSeek-V3-0324 | synthetic | 0.494 | 38.3% | 0.0% | 22097 ms | 33 n=30 | 60 |
| 78 | kimi-k2.5 hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4 | synthetic | 0.457 | 28.3% | 5.0% | 126275 ms | 53 n=29 | 60 |
| 79 | qwen-3-235b-a22b-instruct-2507 | cerebras | 0.403 | 33.3% | 0.0% | 1220 ms | 726 n=5 | 15 |
| 80 | liquid/lfm-2.5-1.2b-thinking:free | openrouter | 0.391 | 0.0% | 6.7% | 749 ms | 3584 n=15 | 15 |
| 81 | gpt-5.2-codex | copilot | 0.341 | 26.7% | 0.0% | 404 ms | — | 15 |
| 82 | hf:meta-llama/Llama-3.1-70B-Instruct | synthetic | 0.266 | 6.7% | 0.0% | 347 ms | — | 15 |
| 83 | minimax-m2 hf:MiniMaxAI/MiniMax-M2 | synthetic | 0.254 | 6.7% | 0.0% | 350 ms | — | 30 |
| 84 | deepseek-v3.1 hf:deepseek-ai/DeepSeek-V3.1hf:deepseek-ai/DeepSeek-V3.1-Terminus | synthetic | 0.148 | 0.0% | 0.0% | 487 ms | — | 60 |
| 85 | kimi-k2 hf:moonshotai/Kimi-K2-Instruct-0905 | synthetic | 0.148 | 0.0% | 0.0% | 344 ms | — | 30 |
| 86 | kimi-k2-thinking hf:moonshotai/Kimi-K2-Thinking | synthetic | 0.148 | 0.0% | 0.0% | 333 ms | — | 30 |
| 87 | minimax-m2.1 hf:MiniMaxAI/MiniMax-M2.1 | synthetic | 0.148 | 0.0% | 0.0% | 344 ms | — | 30 |
| 88 | gpt-5.4 | copilot | 0.148 | 0.0% | 0.0% | 512 ms | — | 30 |
| 89 | gpt-5.4-mini | copilot | 0.148 | 0.0% | 0.0% | 358 ms | — | 30 |
| 90 | claude-opus-4.5 | copilot | 0.148 | 0.0% | 0.0% | 360 ms | — | 30 |
| 91 | qwen3-235b hf:Qwen/Qwen3-235B-A22B-Instruct-2507 | synthetic | 0.148 | 0.0% | 0.0% | 338 ms | — | 30 |
| 92 | llama-4-maverick hf:meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | synthetic | 0.148 | 0.0% | 0.0% | 340 ms | — | 30 |
| 93 | gpt-oss-120b | cerebras | 0.148 | 0.0% | 0.0% | 144 ms | — | 15 |
| 94 | zai-glm-4.7 | cerebras | 0.148 | 0.0% | 0.0% | 129 ms | — | 15 |
| 95 | glm-4.7 | crofai | 0.148 | 0.0% | 0.0% | 910 ms | — | 15 |
| 96 | glm-4.7-flash | crofai | 0.148 | 0.0% | 0.0% | 469 ms | — | 15 |
| 97 | gpt-5-codex | openai | 0.148 | 0.0% | 0.0% | 196 ms | — | 15 |
| 98 | gpt-5.1-codex | openai | 0.148 | 0.0% | 0.0% | 193 ms | — | 15 |
| 99 | gpt-5.1-codex-max | openai | 0.148 | 0.0% | 0.0% | 181 ms | — | 15 |
| 100 | gpt-5.1-codex-mini | openai | 0.148 | 0.0% | 0.0% | 173 ms | — | 15 |
| 101 | glm-5v-turbo | zai | 0.148 | 0.0% | 0.0% | 723 ms | — | 15 |
| 102 | glm-5 | crofai | 0.148 | 0.0% | 0.0% | 604 ms | — | 15 |
| 103 | gpt-5.2-codex | openai | 0.148 | 0.0% | 0.0% | 217 ms | — | 15 |
| 104 | gpt-5.3-codex-spark | openai | 0.148 | 0.0% | 0.0% | 195 ms | — | 15 |
| 105 | hf:meta-llama/Llama-3.1-405B-Instruct | synthetic | 0.148 | 0.0% | 0.0% | 350 ms | — | 15 |
| 106 | greg | crofai | 0.148 | 0.0% | 0.0% | 608 ms | — | 15 |
| 107 | hf:meta-llama/Llama-4-Scout-17B-16E-Instruct | synthetic | 0.148 | 0.0% | 0.0% | 332 ms | — | 15 |
| 108 | gpt-5.3-codex | copilot | 0.148 | 0.0% | 0.0% | 381 ms | — | 15 |
| 109 | gpt-5.5-pro | openai | 0.148 | 0.0% | 0.0% | 191 ms | — | 15 |
| 110 | qwen3.5-397b-a17b | crofai | 0.148 | 0.0% | 0.0% | 512 ms | — | 15 |
| 111 | hf:Qwen/Qwen2.5-Coder-32B-Instruct | synthetic | 0.148 | 0.0% | 0.0% | 341 ms | — | 15 |
| 112 | hf:Qwen/Qwen3-235B-A22B-Thinking-2507 | synthetic | 0.148 | 0.0% | 0.0% | 867 ms | — | 15 |
| 113 | hf:zai-org/GLM-4.6 | synthetic | 0.148 | 0.0% | 0.0% | 345 ms | — | 15 |
Developer — score by category
Architect — score by category
Cross-provider drift
| Model | Role | Baseline | Provider | Acc base | Acc prov | Δ acc | z | Exact | Semantic sim |
|---|---|---|---|---|---|---|---|---|---|
| glm-5.1 | architect | zai | synthetic hf:zai-org/GLM-5.1 | 80.0% | 76.7% | -3.3% | -0.31 | 0.0% | — |
| glm-5.1 | architect | zai | crofai glm-5.1-precision | 80.0% | 70.0% | -10.0% | -1.01 | 0.0% | — |
| glm-5.1 | developer | zai | synthetic hf:zai-org/GLM-5.1 | 72.2% | 83.3% | +11.1% | +1.13 | 5.6% | — |
| glm-5.1 | developer | zai | crofai glm-5.1-precision | 72.2% | 79.2% | +6.9% | +0.81 | 1.4% | — |
| deepseek-v4-pro | architect | deepseek | crofai | 53.3% | 66.7% | +13.3% | +1.05 | 0.0% | — |
| deepseek-v4-pro | developer | deepseek | crofai | 75.0% | 72.2% | -2.8% | -0.27 | 5.6% | — |
| minimax-m2 | architect | minimax MiniMax-M2 | synthetic hf:MiniMaxAI/MiniMax-M2 | 76.7% | 6.7% | -70.0% | -5.50 | 0.0% | — |
| deepseek-v3.2 | architect | synthetic hf:deepseek-ai/DeepSeek-V3.2 | deepseek deepseek-chat | 73.3% | 80.0% | +6.7% | +0.61 | 0.0% | — |
| deepseek-v3.2 | architect | synthetic hf:deepseek-ai/DeepSeek-V3.2 | crofai | 73.3% | 50.0% | -23.3% | -1.86 | 0.0% | — |
| kimi-k2.5 | architect | synthetic hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4 | crofai kimi-k2.5-lightning | 28.3% | 65.0% | +36.7% | +4.03 | 1.7% | — |
| kimi-k2.5 | architect | synthetic hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4 | alibaba | 28.3% | 83.3% | +55.0% | +4.93 | 3.3% | — |
| deepseek-v3.2 | developer | synthetic hf:deepseek-ai/DeepSeek-V3.2 | deepseek deepseek-chat | 75.0% | 75.0% | +0.0% | +0.00 | 2.8% | — |
| deepseek-v3.2 | developer | synthetic hf:deepseek-ai/DeepSeek-V3.2 | crofai | 75.0% | 52.8% | -22.2% | -1.96 | 5.6% | — |
| minimax-m2 | developer | minimax MiniMax-M2 | synthetic hf:MiniMaxAI/MiniMax-M2 | 55.6% | 0.0% | -55.6% | -5.26 | 0.0% | — |
| deepseek-r1 | architect | synthetic hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528 | deepseek deepseek-reasoner | 68.3% | 73.3% | +5.0% | +0.49 | 0.0% | — |
| kimi-k2.5 | developer | synthetic hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4 | crofai kimi-k2.5-lightning | 29.2% | 52.8% | +23.6% | +2.88 | 26.4% | — |
| kimi-k2.5 | developer | synthetic hf:moonshotai/Kimi-K2.5hf:nvidia/Kimi-K2.5-NVFP4 | alibaba | 29.2% | 83.3% | +54.2% | +5.32 | 5.6% | — |
| minimax-m2.1 | architect | minimax MiniMax-M2.1 | synthetic hf:MiniMaxAI/MiniMax-M2.1 | 83.3% | 0.0% | -83.3% | -6.55 | 0.0% | — |
| deepseek-r1 | developer | synthetic hf:deepseek-ai/DeepSeek-R1hf:deepseek-ai/DeepSeek-R1-0528 | deepseek deepseek-reasoner | 48.6% | 80.6% | +31.9% | +3.18 | 0.0% | — |
| minimax-m2.1 | developer | minimax MiniMax-M2.1 | synthetic hf:MiniMaxAI/MiniMax-M2.1 | 58.3% | 0.0% | -58.3% | -5.44 | 5.6% | — |
| minimax-m2.5 | architect | minimax MiniMax-M2.5MiniMax-M2.5-highspeed | crofai | 71.7% | 56.7% | -15.0% | -1.42 | 0.0% | — |
| minimax-m2.5 | architect | minimax MiniMax-M2.5MiniMax-M2.5-highspeed | synthetic hf:MiniMaxAI/MiniMax-M2.5 | 71.7% | 83.3% | +11.7% | +1.21 | 0.0% | — |
| minimax-m2.5 | developer | minimax MiniMax-M2.5MiniMax-M2.5-highspeed | crofai | 59.7% | 58.3% | -1.4% | -0.14 | 13.9% | — |
| minimax-m2.5 | developer | minimax MiniMax-M2.5MiniMax-M2.5-highspeed | synthetic hf:MiniMaxAI/MiniMax-M2.5 | 59.7% | 61.1% | +1.4% | +0.14 | 11.1% | — |
Score over time
Methodology & caveats
Each model is evaluated against a fixed suite of 35 prompts (20 developer, 15 architect) covering code generation from spec, bug fixing, refactoring, multi-file edits, tool use, architecture decision records, design, trade-off analysis and problem decomposition. Code prompts are scored by running a hidden pytest suite inside a Docker sandbox. Text prompts are scored by a combination of regex/contains verifiers and an LLM-as-judge step (Claude Opus 4.7).
⚠Known judge bias. The LLM judge (Claude Opus) may favour responses that match its own style: more verbose, more structured, or following Anthropic's coding conventions. Text scores are indicative only; pytest sandbox results are the most reliable signal. Treat rankings as relative guidance, not ground truth.