Anthropic released Claude Opus 4.6 with a 1-million-token context window (beta), extended thinking, and a METR-estimated 50% task-completion time horizon of 14 hours 30 minutes — the longest of any model.
It scores 80.8% on SWE-Bench Verified, narrowly leading Gemini 3.1 Pro's 80.6%. The model represents a significant step forward for long-context coding and agentic tasks.
The extended thinking capability allows Opus 4.6 to reason through complex problems step by step before generating a final answer, improving accuracy on multi-step tasks.
See the full leaderboard for updated rankings across all categories.