AI Models Fall Flat: $2 Million Prize Remains Unclaimed as Top Models Score Below 1%
The ARC-AGI-3 benchmark has put the world's top AI models to the test, and the results are underwhelming, with even the best models scoring below 1%. A $2 million prize remains unclaimed, highlighting the significant gap between human and artificial intelligence.
The latest benchmark from the ARC Prize Foundation has delivered a sobering reality check for the AI community. Despite being pitted against interactive game environments that humans can solve with ease, the top models have failed to impress. Gemini 3.1 Pro Preview, GPT 5.4, Opus 4.6, and Grok-4.20 have all scored dismally, with the highest score being a mere 0.37%. This poor performance is even more striking given that humans with no prior knowledge or instructions can solve these environments with ease.
The ARC-AGI-3 benchmark uses a unique metric called Relative Human Action Efficiency, which measures the number of actions required by a model compared to a human. This approach penalizes brute-force strategies, making it a more accurate measure of intelligence. With a $2 million prize on the line, the stakes are high, and the failure of top models to deliver is a significant setback. This highlights the need for further research and development in AI, particularly in areas such as general intelligence and problem-solving. For AI model users and developers, this means that current models are still far from being able to match human-level intelligence, and significant advancements are needed before they can be relied upon for complex tasks.