Quality benchmarking beta

week 2026-W22 · 0 runs · generated 2026-05-31T20:14:33+00:00 · embeddings off

Alerts

No drift or refusal-spike alerts this week.

Developer ranking

No runs yet for developer.

Architect ranking

No runs yet for architect.

Cross-provider drift

No drift comparisons available — register a baseline provider in quality_models.official_provider_id or quality_providers.baseline_for_json.

Score over time

Methodology & caveats

Each model is evaluated against a fixed suite of 35 prompts (20 developer, 15 architect) covering code generation from spec, bug fixing, refactoring, multi-file edits, tool use, architecture decision records, design, trade-off analysis and problem decomposition. Code prompts are scored by running a hidden pytest suite inside a Docker sandbox. Text prompts are scored by a combination of regex/contains verifiers and an LLM-as-judge step (Claude Opus 4.7).

Known judge bias. The LLM judge (Claude Opus) may favour responses that match its own style: more verbose, more structured, or following Anthropic's coding conventions. Text scores are indicative only; pytest sandbox results are the most reliable signal. Treat rankings as relative guidance, not ground truth.