CrofAI Hobby
Aggregator that resells open-source models behind one cheap key. Cheapest available plan at $5/mo (Hobby); peak 614.4 tok/s on kimi-k2.5-lightning.
Real-world throughput for every model on every coding-agent subscription we could get our hands on. Same prompt, same client, every model the provider exposes. Updated weekly.
The fastest available subscription tier in each price bracket — based on the peak tok/s we measured on each provider's plan. Buckets with no available subscription are skipped.
Aggregator that resells open-source models behind one cheap key. Cheapest available plan at $5/mo (Hobby); peak 614.4 tok/s on kimi-k2.5-lightning.
Coverage paused — Synthetic subscription not renewed for 2026-W22; historical data retained. Cheapest available plan at $30/mo (Subscription); peak 224.2 tok/s on hf:moonshotai/Kimi-K2.6.
mimo-v2.5 via Xiaomi direct (Token Plan Europe) for problems that need long reasoning — 0.919 quality, 75 tok/s, 1M context, and a price tag that just got slashed (see notes). MiniMax-M2.5 via the Lightning route for heavy workloads where 100+ tok/s and Anthropic-compat wire format make Claude Code integration trivial. Both are in the quality top-10 and both bill flat-rate — no runaway cost on long agentic sessions. If budget is not the constraint: claude-opus-4-8 is out of debate this week — #1 overall (0.973), #2 architect, tied #1 developer. But at $15/$75 per million tokens it's a one-shot tool for critical work, not something to iterate against.:free tier or MiniMax under load — baidu/cobuddy:free was the surprise at #1 in architect (0.989).A loaded week of movements on the board.
claude-opus-4-8 ships and dominates. Anthropic released claude-opus-4-8 this week and it tops all three of our rankings: #1 overall with an aggregate score of 0.973 (vs 0.948 for opus-4-7), tied #1 developer at 0.969, #2 architect at 0.976. At $15/$75 per million tokens it remains a model for occasional use, not iteration, but the quality jump over 4-7 justifies it where the boundary used to be blurry.
poolside/laguna-xs.2:free is the surprise of the year, not just the week. Poolside's model (lab founded by Eiso Kant and Jason Warner) ties opus-4-7 in the overall top (0.948) and ties opus-4-8 itself in developer with 0.969. What makes it genuinely disruptive: it is open-weight under Apache 2.0 on Hugging Face (huggingface.co/poolside/Laguna-XS.2) and Poolside claims it runs on a single GPU — 33B total / 3B active (MoE), 128K context, SWA with per-head gating, fp8. A frontier-class coding model you can self-host on reasonable hardware. The :free route on OpenRouter is a launch promo, not permanent — use it while it lasts. It is the first time MSA has seen an open-weight model compete technically with proprietary frontier in its own league.
Xiaomi in, Synthetic out. Xiaomi MiMo debuts via its Token Plan Europe (4 SKUs benched: omni, pro, v2.5, v2.5-pro; the TTS / voice-clone variants are filtered out at the provider level). Solid results: mimo-v2-pro ties #1 in developer (0.969), mimo-v2.5 sits #4 in architect (0.939). On top of that, Xiaomi announced (2026-05-28, effective 2026-05-27) a permanent price cut across the V2.5 family: V2.5-Pro input drops from $1.00 to $0.435 per million, output from $3.00 to $0.87, cached input from ~$0.36 to $0.0036 — the ~99% headline cut. Context-length pricing tiers are gone. Token Plan balances were reset with credit equivalence 5-8× higher. At this price Xiaomi MiMo lands in DeepSeek territory and becomes a natural pick for long-reasoning workloads. Source: platform.xiaomimimo.com. Synthetic, meanwhile, leaves the board: we did not renew the subscription this cycle — the 33 catalogue SKUs all returned HTTP 402 "balance $0.00". The provider stays on the site with its history and a pause caveat; if anyone wants to sponsor an API key so we can keep covering it, we'll wire it back in (see the provider's card).
CrofAI is reworking subscriptions — likely off the bench soon. CrofAI posted (2026-05-31) that flat-rate subs are no longer sustainable. Two options are on the table: convert subs to credit-based plans worth 1.15× their PAYG value (a $5 Hobby tier becomes $5.75 of credit per month), or remove subscriptions entirely and push PAYG prices as low as possible. The change starts 2026-05-31 and should be done by 2026-06-01 to 06-03; currently purchased subs keep working until they end or renew before anything switches. Either way the flat, unlimited $5 Hobby plan that made CrofAI our volume daily driver disappears — under metered PAYG, running the full weekly suite against CrofAI stops being free. We've already removed CrofAI from this week's pick, and expect to drop it from the weekly bench once the new pay-per-use pricing goes live (likely 2026-W23). The provider card and historical data stay on the site; if PAYG turns out cheap enough to keep covering it, we'll revisit.
Alibaba: goodbye starting next week. The Model Studio Token Plan FAQ states verbatim: "This plan is for interactive use within compatible AI coding and agent tools only. It cannot be used for automated scripts or application backends." with penalty: "subscription suspension or API Key blocking." The legacy Coding Plan carries an equivalent clause ("programming tools only"). Our MSA bench is exactly what the clause forbids — an automated script, not a human-supervised interactive agent. The clause is not new (it has been in place for months), but after rereading it in cold blood this week we decided to honour it. W22 is Alibaba's last appearance on the board. The historical data stays for context; the provider card gets a "Retired (ToS)" badge. Thanks to forkline.dev for supplying the Alibaba API key that let us include them in the weekly bench up to this point. Bye bye Ali, hi Xiaomi.
Catalogue: what's in, what's out. Cerebras retired qwen-3-235b-a22b-instruct-2507 (HTTP 404 across the run). opencode-go rotated: +qwen3.7-max, -qwen3.5-plus. OpenRouter lost baidu/cobuddy:free — the W20 surprise pick — and the entire baidu/cobuddy family is gone from the catalogue, not just the free SKU (Baidu still ships Ernie 4.5 there). It also lost arcee-ai/trinity-large-thinking:free; the paid arcee-ai/trinity-large-thinking variant is still listed, so if you want that model the route exists, just not at $0. New: moonshotai/kimi-k2.6:free. OpenRouter stays filtered to :free SKUs only on our side — the paid routes duplicate providers we already bench directly.
Drift by (provider, model): the inference provider matters as much as the model. Same model, different provider, very different score. The findings: glm-5 on CrofAI scored 0.07 vs 0.941 on Alibaba and 0.908 on opencode-go — CrofAI served the broken variant (HTTP 200 + empty body on the precision endpoints). deepseek-v4-flash on opencode-go scored 0.547 vs 0.945 on CrofAI and 0.913 on DeepSeek direct — opencode-go consistently penalises DeepSeek. MiniMax-M2.5 on Alibaba: 0.919 vs 0.548 on MiniMax direct — this week Alibaba served M2.5 better than MiniMax themselves, inverting the usual reading. mimo-v2.5 on Xiaomi direct 0.919 vs opencode-go 0.725 — Xiaomi direct wins by 19pp. The takeaway: for popular open-source weights, who runs the inference matters as much as which model you pick. Stable across providers (≤8pp spread): glm-4.7, glm-5.1, mimo-v2.5-pro, qwen3.6-plus.
Raw speed. Cerebras holds the crown with zai-glm-4.7 at 521 tok/s end-to-end and 1,490 tok/s generation rate (TTFT 2.6s) — unbeatable on latency. The new nvidia/nemotron-3-nano family on OpenRouter :free pulls 200+ tok/s with sub-1s TTFT — a cheap fallback lane for non-critical work. CrofAI Lightning (kimi-k2.5-lightning) generates at 503 tok/s once it gets going, but the 10s TTFT kills the end-to-end number.
Discovered this week: MiniMax's official CLI mmx (github.com/MiniMax-AI/cli, MIT). Install: npm install -g mmx-cli. It is multimodal — text, image, video, speech, music, vision, web search — with non-interactive flags and JSON-schema export so agents (Cursor, Claude Code, opencode) can register mmx as a tool. Requires a MiniMax Token Plan (no PAYG path per the README). If you have the subscription it is worth trying — the opencode integration is direct.
Want a provider covered? Sponsor it. Synthetic is on hold this week for exactly that reason — an API key with modest quota, single-digit million tokens per month, is enough to bring it back. millaguie [at] gmail [dot] com.
A methodology change and a few board moves this week.
We now measure time-to-first-token. Every speed benchmark streams and separates TTFT from total time; we report a generation rate (tok/s excluding the prefill/queue wait) alongside the end-to-end figure.
The DeepSeek 5,000+ tok/s mystery is solved — it was a measurement artifact. The old single-elapsed timing folded queue and prefill into the rate. With TTFT separated, DeepSeek lands at ~60-75 tok/s end-to-end. That closes the “more next week” note from the index.
What TTFT exposes. End-to-end tok/s was hiding large prefill waits: CrofAI's kimi-k2.5-lightning generates at ~6,500 tok/s once it starts but carries an 11.7s TTFT; the Synthetic and opencode-go GLMs lose 7-10s to prefill. Cerebras is the genuine speed leader — llama3.1-8b at 1,479 tok/s and gpt-oss-120b at 1,372, both with 0.2-0.3s TTFT.
CrofAI is off the board — a serving outage. For the last few hours CrofAI has been returning HTTP 200 with empty bodies across much of its catalogue (the -precision/-pro/-flash variants and the newer, bigger models: deepseek-v4-pro-precision 71% empty, gemma-4-31b-it 55%, glm-5.1-precision 45%). It also hit the model we use as our cost-optimized judge, so this week we re-judged the architecture suite against our reference judge, claude-opus-4-7. CrofAI comes back when serving stabilises — and yes, that moves it out of the pick too (it was last week's pick; when something breaks, we say so).
OpenAI is retired from the board — our call. We brought ChatGPT in for a few weeks to evaluate it and aren't convinced it earns a permanent column, so we're dropping it (the larger gpt-5.x and -codex models also need a Pro subscription we no longer keep). Want a provider covered? Sponsor it. Lend us an API key with modest quota — single-digit millions of tokens a month is enough — and we'll bench it under the same disclosure rules as everyone else. API-credit donations welcome: millaguie [at] gmail [dot] com.
Also out: Synthetic finished its retirement pass (DeepSeek V3.x, Llama 3/4, Qwen2.5-Coder now return 404); alibaba/qwen3.7-max didn't respond (errors across the run); Copilot stays free-tier only.
Quality (architect, judged by opus-4-7). Top of the board: claude-sonnet-4-5 (0.976) and claude-opus-4-7 (0.974), then kimi-k2.5 on Alibaba (0.967), Kimi-K2.6 on Synthetic (0.966) and claude-sonnet-4-6 (0.966). Note: switching the judge from kimi-k2.6-precision to opus-4-7 recalibrates absolute scores, so this week's architect numbers are not directly comparable to the W18-W20 trend line.
Five things landed this week.
opencode-go debuts — and falls 40% behind on its flagship. The opencode team shipped their own pay-as-you-go gateway (opencode.ai/zen/go/v1), a curated open-source catalogue of twelve models (DeepSeek V4, GLM 5.x, Kimi K2.5/K2.6, MiMo V2.5, MiniMax M2.5/M2.7, Qwen 3.5/3.6). Convenient — one key for what opencode itself uses — but on quality benches it trails the direct providers it routes to. Our drift detector flagged deepseek-v4-pro on opencode-go at -40 % vs DeepSeek direct (critical, z=-2.20), kimi-k2.6 at -20 % vs CrofAI, and minimax-m2.7 at -9 % vs MiniMax direct. The lighter models fare better (minimax-m2.5 placed #7 in the combined ranking), but the routing markup hits the heavy reasoners disproportionately. Useful as a convenience layer; not a replacement for the upstreams.
OpenAI is off the bench this week. Our ChatGPT Pro subscription — the one we measured the gpt-5.x family against — lapsed before W20 started, and several of the larger models (gpt-5.5, gpt-5.5-fast, all -codex variants) require Pro to access. Rather than publish a half-measured OpenAI column, we pulled it entirely for the week. It comes back when we either renew or switch to a different auth flow that survives the Pro lapse. Historical W18-W19 OpenAI data on the dashboard stays.
GitHub Copilot, also off — but for editorial reasons. Microsoft's recent platform direction has made it less interesting to bench as a coding-agent option (model availability per plan, rate-limit changes, and the new opt-outs all moving in directions that erode what made Copilot worth the column). The auth is still wired and we keep the historical numbers, but we're not running Copilot in the weekly bench until that picture stabilises.
Catalogue churn. CrofAI added four models this week (deepseek-v4-flash, deepseek-v4-pro-precision, mimo-v2.5-pro, mimo-v2.5-pro-precision) and retired qwen3.5-9b-chat. The Xiaomi MiMo line lands well: mimo-v2.5-pro-precision sits #11 in the architect rankings at 0.972, holding its own against Qwen and GLM at similar tok/s. Synthetic, meanwhile, ran a big retirement pass — fourteen models now return HTTP 404 "no longer supported", including the Llama 3.x/4 family, DeepSeek V3 through V3.2, Qwen2.5-Coder, MiniMax-M2/M2.1, the Kimi-K2-Instruct/Thinking pair and GLM-4.6. Their catalogue is consolidating around current SKUs only.
baidu/cobuddy:free is the genuine surprise. Topped the architect rankings outright (0.989), placed mid-table in developer (0.884), and clocked a 499 tok/s median throughput — all with no per-token cost on OpenRouter's free tier. We've eyeballed a handful of its responses and they hold up; this isn't the usual :free tradeoff of "works until it doesn't." If you need a second lane behind your paid daily driver, this is currently the strongest free option on the board.
How opencode-go and Alibaba Cloud got onto the bench. Both providers are on the leaderboard because somebody at — or with — the provider sent us an API key with enough quota to run the standard weekly suite. We will do the same for any provider we're not currently covering. The deal stays the same as for everyone else already on the board: we wire the auth, we publish the numbers (the unflattering ones included), and we annotate the methodology so readers can reproduce. If you can spot us a key — single-digit millions of tokens per month is enough — write to millaguie [at] gmail [dot] com or open an issue on the repo.
Heads-up on Alibaba Cloud. Starting next month, Alibaba's updated Terms of Service will rule out the kind of access this benchmark relies on. We expect to lose the Alibaba column from the next monthly cycle onward — keys revoked, numbers frozen. The historical W18-W20 Alibaba data on the dashboard stays so the trend lines remain intact, but the weekly leaderboard will no longer include them once the new ToS takes effect. If Alibaba shipped a compliant path for third-party benches before then, we'd switch to it; right now there isn't one.
Three things landed this week.
OpenRouter as a new front door. 187 models in their catalogue, 25 of them :free text models. With $10+ in credit you cross into the higher daily quota tier (~1000 free-model requests/day), making them viable for both speed and quality benches. Two surprises in the speed pass: nvidia/nemotron-3-nano-30b-a3b:free clocked 4177 tok/s and poolside/laguna-xs.2:free hit 3746 tok/s — both in Cerebras territory, unusual for OpenRouter-routed inference.
Catalogue refresh. Most providers shipped new SKUs since W18: Claude added Sonnet 4.5; Copilot opened up the gpt-5.x family plus grok-code-fast-1; Alibaba's Qwen3.6 lineup is in; Cerebras now serves zai-glm-4.7; CrofAI gained five Kimi and qwen3.5 variants; OpenAI's gpt-5.5 family and several gpt-5.4 variants are tested for the first time.
Alibaba opened a second door — the Token Plan (Team Edition). After months of the legacy Coding Plan being effectively unobtainable, Alibaba shipped a parallel Token Plan that you can actually buy. Three tiers per seat: Standard $30/mo (25k Credits), Pro $100/mo (100k Credits, their recommended tier), Max $200/mo (250k Credits). Floor is cheaper than the legacy $50 Pro, but the closest equivalent is now $100 — a 2× hike at the volume most users actually want.
Credits vs requests — what really changed. The legacy Coding Plan billed by request count (one HTTP call to the endpoint). A single user query in an agent like opencode or Claude Code typically expands into 5-30 requests internally — planning, tool calls, re-prompts — so 90k requests/mo at $50 actually meant a few thousand user-visible queries. The Token Plan bills by Credits derived from input + cached + output tokens, modulated by model, thinking mode and tool calls. Alibaba doesn't publish the credit-to-token ratio, so budgeting precisely is impossible without measuring your own workload first. Practical effect: request-based subsidised thinking-mode and long-output models (they cost the same as a one-shot reply); credit-based makes you pay for those tokens proportionally. Short verbose-tool-using agents are better off on Credits; long-context, deep-reasoning, terse-output workloads were better off on the old request quota — which you can't buy anymore.
Catch: the Token Plan ToS restrict it to interactive use with compatible AI coding/agent tools only. What the FAQ explicitly forbids: automated scripts, application backends, benchmarking and research scripts — violations may trigger API key revocation. What stays allowed: long agentic sessions inside a compatible tool (Claude Code, opencode, Qwen Code, etc.) with a human in the loop, even if the agent itself runs autonomously for hours. The line is "is there a person driving a compatible tool?", not session duration. Our weekly bench against Alibaba runs against a legacy Coding Plan subscription where this clause didn't exist; new subscribers on the Token Plan should keep usage inside one of those compatible agents.
A note on :free quotas. OpenRouter's free tier resets daily at midnight UTC, but the per-model bottleneck is usually the upstream provider, not OpenRouter itself. Several Meta-llama and Mistral-derived variants returned HTTP 429s all week and never recovered across 11 retry attempts — those are upstream pool limits, not your account's. Cross-provider drift comparisons (openai/gpt-oss-120b OpenRouter vs Synthetic; z-ai/glm-4.5-air OpenRouter vs zai) are now possible and arguably the real new value here.
Peak tok/s ÷ cheapest available plan price. Pay-as-you-go-only providers are excluded — speed-per-subscription-dollar is undefined for them.
Models that cost $0 per use — either because the provider's cheapest available plan is free (Copilot Free, OpenRouter free tier) or because the SKU itself is marked free in the catalogue. Listed separately because speed-per-$ is undefined at $0 and would skew the value ranking above. Quotas and rate-limits are tight — useful as a fallback lane, not as a primary subscription.
… and 53 more free-tier model(s) measured.
Output tokens per wall-clock second. Filter by provider, change the sort, or expand to all measured models.
A note on the DeepSeek numbers: we're still investigating whether they're real. We've run more tests than the ones published here and they all land in the same range, so it's not a one-off measurement glitch. More info next week. Bars use a log scale so the smaller subscription numbers stay legible alongside Cerebras' genuinely-sustained 1,000–2,600 tok/s.
Best output tok/s recorded per model and provider, per measurement date. Only models measured on 2+ dates are shown.
Every coding subscription tier we know about, with measured peak speed and current availability. The cheapest available plan on each provider is highlighted.
millaguie [at] gmail [dot] com.No subscription tier — pay per token. Fast for sporadic use, harder to budget for daily agent loops.
Chinese AI lab known for open-weight MoE models that punch above their price.