About this leaderboard

How the numbers are produced and what they mean.

Methodology

Every benchmark session sends the same prompt to every model a provider exposes and records latency and output tokens per second. The site refreshes weekly.

Why we don't compare list prices

Most providers covered here sell flat-rate coding subscriptions (Claude Code, ChatGPT/Codex, GLM Coding Plan, MiniMax Token Plan, Alibaba Coding Plan). Comparing per-token list prices across pay-as-you-go and subscription tiers is misleading — the real cost is the monthly fee, and what you get back depends on your usage pattern. We focus on throughput, reliability, and speed-per-subscription-dollar.

What we measure

Throughput (tok/s) — output tokens divided by wall-clock time. Network and queueing both count.
Best vs avg tok/s — peak observed and mean across all successful runs.
Success rate — fraction of runs that completed without HTTP / parsing errors.
Output length — chars in the longest successful response.
Speed per $/mo — peak tok/s ÷ cheapest available plan price for each provider.

Why provider matters

The same model is often available on multiple providers. Speed depends on the provider's hardware (Cerebras WSE-3 vs commodity GPUs), routing, and congestion. We always tag each row as provider · model so you can tell apart, say, GLM-4.7 on Cerebras (≈1,000+ tok/s) from GLM-4.7 on z.ai (subscription throughput).

Caveats and disclaimers

Quality is not measured. A fast wrong answer is still wrong.
Tail latency is not measured. Averages dominate; p99 numbers would need many more samples per model.
Time-to-first-token is not measured. We capture end-to-end latency only — streaming UX may feel different.
Suspiciously high tok/s (above ~3,000) on short prompts often indicates cache hits or warmed routes, not sustained agentic performance.
Subscription "value" uses the cheapest available plan. A $10 plan with 100 prompts/5h beats a $50 plan only if that quota fits your workload.

Acknowledgements

Part of the data behind this leaderboard is provided by forkline.dev — thanks for the access.