Cerebras OpenAI-compatible

Wafer-scale inference — 1,000 to 2,600 tok/s on open-source models.

⚠️Heads up: Waitlist only. Code Pro and Code Max have not been re-opened to new customers — Cerebras has kept them on an indefinite waitlist since the initial rollout sold out. The PAYG free tier (1M tokens/day) and pay-per-token usage are unaffected.

Cerebras builds wafer-scale processors that run open-source models (Llama, Qwen, GLM, gpt-oss) at sustained 1,000–2,600 tok/s — typically an order of magnitude faster than commodity GPUs. Speed is the whole pitch: same models, vastly higher throughput.

Strengths

Highest tok/s on the market for large open-source models
Free tier: 1M tokens / day rate-limited
Drop-in OpenAI-compatible endpoint

When to use it

Interactive chat where latency dominates UX
Coding agents that iterate frequently
Live demos where speed sells

Subscription plans

Plan	Price	Quota	Available
Code Pro	$50/mo	~24M tokens/day	closed / sold out
Code Max	$200/mo	~120M tokens/day	closed / sold out

Notes: Both Code tiers were sold out at the time of writing. Pay-as-you-go with 1M free tokens/day per account is always on for any catalogue model.

Referral: Cerebras runs a token-bonus referral program: +200K Qwen3-Coder tokens/day for both you and your invitee, capped at 1M extra tokens/day. Generate the link from the in-product flow and replace `signup_url` with your personal one.

Models tested on Cerebras

Speed numbers below are specific to Cerebras's routing and hardware. The same model may appear on other providers' pages with different throughput.

Model	Best tok/s	Avg tok/s	Runs	Success	Longest output (chars)
llama3.1-8b	1500.7	1258.4	4	100%	2,632
qwen-3-235b-a22b-instruct-2507	794.2	689.4	4	75%	3,487