About this leaderboard
How the numbers are produced and what they mean.
Methodology
Every benchmark session sends the same prompt to every model a provider exposes and records latency and output tokens per second. The site refreshes weekly.
Why we don't compare list prices
Most providers covered here sell flat-rate coding subscriptions (Claude Code, ChatGPT/Codex, GLM Coding Plan, MiniMax Token Plan, Alibaba Coding Plan). Comparing per-token list prices across pay-as-you-go and subscription tiers is misleading — the real cost is the monthly fee, and what you get back depends on your usage pattern. We focus on throughput, reliability, and speed-per-subscription-dollar.
What we measure
- Throughput (tok/s) — output tokens divided by wall-clock time. Network and queueing both count.
- Time-to-first-token (TTFT) — seconds from sending the request to the first streamed token. Captures prefill + queue + network round-trip.
- Generation rate (gen tok/s) — output tokens ÷ (elapsed − TTFT). The model's steady-state decode speed, with the prefill/queue wait excluded.
- Best vs avg tok/s — peak observed and mean across all successful runs.
- Success rate — fraction of runs that completed without HTTP / parsing errors.
- Output length — chars in the longest successful response.
- Speed per $/mo — peak tok/s ÷ cheapest available plan price for each provider.
Why provider matters
The same model is often available on multiple providers. Speed depends on the provider's hardware (Cerebras WSE-3 vs commodity GPUs), routing, and congestion. We always tag each row as provider · model so you can tell apart, say, GLM-4.7 on Cerebras (≈1,000+ tok/s) from GLM-4.7 on z.ai (subscription throughput).
Caveats and disclaimers
- Quality is not measured. A fast wrong answer is still wrong.
- Tail latency is not measured. Averages dominate; p99 numbers would need many more samples per model.
- Suspiciously high tok/s (above ~3,000) on short prompts often indicates cache hits or warmed routes, not sustained agentic performance.
- Subscription "value" uses the cheapest available plan. A $10 plan with 100 prompts/5h beats a $50 plan only if that quota fits your workload.
Data sources & attribution
MSA stitches together a few public datasets — credit where credit is due. If you build on top of MSA's output, please attribute these upstream sources alongside us.
- Artificial Analysis — the external coding-benchmark scores (LiveCodeBench, terminal-bench-hard, SWE-bench Verified, scicode, the intelligence and coding indices) shown on the per-model family pages. AA owns the data; MSA just joins it onto our family slugs. Their per-model pages are linked inline next to every score block.
- models.dev (maintained by SST) — catalogue enrichment: knowledge-cutoff dates, release dates, capability flags, open-weights status, and the lifecycle marker behind the leaderboard's deprecated/beta badges. Used when opencode's own catalogue leaves a field blank.
-
opencode
— the canonical model catalogue (provider list, model IDs, base URLs,
pricing, capabilities). MSA delegates to
opencode models --verboseso the bench always tracks the user's opencode configuration. - forkline.dev — part of the throughput data behind the leaderboard. Thanks for the access.
MSA's own numbers (the per-(provider, model) speed measurements produced by the weekly run) are released under the same disclosure rules we apply to everyone covered: published warts and all, reproducible from the open-source code in this repository.