Cerebras OpenAI-compatible
Wafer-scale inference — 1,000 to 2,600 tok/s on open-source models.
⚠️Heads up: Waitlist only. Code Pro and Code Max have not been re-opened to new customers — Cerebras has kept them on an indefinite waitlist since the initial rollout sold out. The PAYG free tier (1M tokens/day) and pay-per-token usage are unaffected.
Cerebras builds wafer-scale processors that run open-source models (Llama, Qwen, GLM, gpt-oss) at sustained 1,000–2,600 tok/s — typically an order of magnitude faster than commodity GPUs. Speed is the whole pitch: same models, vastly higher throughput.
Strengths
- Highest tok/s on the market for large open-source models
- Free tier: 1M tokens / day rate-limited
- Drop-in OpenAI-compatible endpoint
When to use it
- Interactive chat where latency dominates UX
- Coding agents that iterate frequently
- Live demos where speed sells
Subscription plans
| Plan | Price | Quota | Available |
|---|---|---|---|
| Code Pro | $50/mo | ~24M tokens/day | closed / sold out |
| Code Max | $200/mo | ~120M tokens/day | closed / sold out |
Notes: Both Code tiers were sold out at the time of writing. Pay-as-you-go with 1M free tokens/day per account is always on for any catalogue model.
Referral: Cerebras runs a token-bonus referral program: +200K Qwen3-Coder tokens/day for both you and your invitee, capped at 1M extra tokens/day. Generate the link from the in-product flow and replace `signup_url` with your personal one.
Models tested on Cerebras
Speed numbers below are specific to Cerebras's routing and hardware. The same model may appear on other providers' pages with different throughput.
| Model | Best tok/s | Avg tok/s | Runs | Success | Longest output (chars) |
|---|---|---|---|---|---|
| llama3.1-8b | 1500.7 | 1258.4 | 4 | 100% | 2,632 |
| qwen-3-235b-a22b-instruct-2507 | 794.2 | 689.4 | 4 | 75% | 3,487 |