Quality benchmarking beta
Alerts
No drift or refusal-spike alerts this week.
Developer ranking
No runs yet for developer.
Architect ranking
No runs yet for architect.
Cross-provider drift
No drift comparisons available — register a baseline provider in quality_models.official_provider_id or quality_providers.baseline_for_json.
Score over time
Methodology & caveats
Each model is evaluated against a fixed suite of 35 prompts (20 developer, 15 architect) covering code generation from spec, bug fixing, refactoring, multi-file edits, tool use, architecture decision records, design, trade-off analysis and problem decomposition. Code prompts are scored by running a hidden pytest suite inside a Docker sandbox. Text prompts are scored by a combination of regex/contains verifiers and an LLM-as-judge step (Claude Opus 4.7).
⚠Known judge bias. The LLM judge (Claude Opus) may favour responses that match its own style: more verbose, more structured, or following Anthropic's coding conventions. Text scores are indicative only; pytest sandbox results are the most reliable signal. Treat rankings as relative guidance, not ground truth.