I benchmarked GPT-5.5, it solved a lab that no prior model has ever solved

I tested GPT-5.5 against Claude Sonnet 4.6 and Gemini 3 Flash. Chose base models to avoid bias against any providers.

I ran them against 8 cybersecurity challenges, ranging from beginner to advanced. Each model had 3 attempts per lab, with a max of 30 steps per lab.

All the models solved exactly the same labs, but thanks to keeping track of their behavior throughout the task, I gleaned multiple interesting insights.

The standout result however, was that GPT-5.5 was the first model I tested to solve a particular advanced lab. I used this specific lab as a real test of intelligence. The obvious path to solve this requires hundreds of steps, but it is relatively straight-forward.

The real solution, given this budget constraint, is to ignore the lab description, and choose a faster and more efficient path.

GPT-5.5 was the first model to ever solve it.

Full write-up here:

https://tarantulabs.com/research/frontier-three-head-to-head-2026-04

If you’d like to benchmark and evaluate the models yourselves, the full benchmark is on HuggingFace and GitHub.

submitted by /u/dvnci1452
[link] [comments]

Read More >>