Cerebras OpenAI-compatible

Wafer-scale inference — 1,000 to 2,600 tok/s on open-source models.

⚠️Heads up: Waitlist only. Code Pro and Code Max have not been re-opened to new customers — Cerebras has kept them on an indefinite waitlist since the initial rollout sold out. The PAYG free tier (1M tokens/day) and pay-per-token usage are unaffected.

Cerebras builds wafer-scale processors that run open-source models (Llama, Qwen, GLM, gpt-oss) at sustained 1,000–2,600 tok/s — typically an order of magnitude faster than commodity GPUs. Speed is the whole pitch: same models, vastly higher throughput.

Strengths

  • Highest tok/s on the market for large open-source models
  • Free tier: 1M tokens / day rate-limited
  • Drop-in OpenAI-compatible endpoint

When to use it

  • Interactive chat where latency dominates UX
  • Coding agents that iterate frequently
  • Live demos where speed sells

Subscription plans

PlanPriceQuotaAvailable
Code Pro$50/mo~24M tokens/dayclosed / sold out
Code Max$200/mo~120M tokens/dayclosed / sold out
Notes: Both Code tiers were sold out at the time of writing. Pay-as-you-go with 1M free tokens/day per account is always on for any catalogue model.
Referral: Cerebras runs a token-bonus referral program: +200K Qwen3-Coder tokens/day for both you and your invitee, capped at 1M extra tokens/day. Generate the link from the in-product flow and replace `signup_url` with your personal one.

Models tested on Cerebras

Speed numbers below are specific to Cerebras's routing and hardware. The same model may appear on other providers' pages with different throughput.

Model Best tok/sAvg tok/s RunsSuccess Longest output (chars)
llama3.1-8b1500.71258.44100%2,632
qwen-3-235b-a22b-instruct-2507794.2689.4475%3,487