Wink Pings

Since Its June Release, Gemini 2.5 Pro Still Ranks #1 in Simple Benchmarks

Despite the release of newer models like Grok 4, GPT-5, and Sonnet 4.5, Gemini 2.5 Pro continues to lead in benchmark tests on simple-bench.com. Some believe its advantage lies in spatial and visual understanding, while others question if it's the result of targeted optimization.

![Screenshot of Simple Bench Rankings](https://simple-bench.com/index.html)

Four months later, Gemini 2.5 Pro still holds the top spot on simple-bench.com's benchmark tests.

Since its release in June, this model has faced challenges from newcomers like Grok 4, GPT-5, and Sonnet 4.5 but remains undefeated. In the AI world, four months equals several technological cycles, making such sustained performance rare.

Some users have pointed out that Gemini excels particularly in spatial and visual understanding. This could be its advantage—after all, simple-bench primarily tests basic common-sense reasoning abilities.

However, controversy exists. Some believe Google specifically optimized for this type of test, similar to what was done with lmarena previously. Notably, the 03-25 preview version scored around 50%, while the final version performed better, raising suspicions about whether test set data was used for training.

On the other hand, there's a counterargument: if it were truly targeted training, the model should perform poorly on new benchmarks, but Gemini actually performs well. A more plausible explanation is that the training data happens to align closely with this particular benchmark.

Interestingly, the 06-05 version actually performed worse than the 03-25 version on certain benchmarks, indicating that the optimization process isn't a one-way improvement but involves trade-offs.

Currently, the Gemini 3.0 team also seems to be closely monitoring this benchmark, possibly even delaying their release plan because of it. In the AI race, the weight given to benchmarks can significantly influence the entire development pace.

发布时间: 2025-10-21 12:34