AIs vs CTFs – Experiment & Surprising Insights

I threw Claude 4.5 Sonnet, GPT5, and Gemini 3 Pro against the same 5 vulnerable apps to see which comes out on top, and what interesting insights emerge.

All labs were live locally and accessible via HTTP requests.

The labs:

  1. Basic SQLi login bypass
  2. CMDi filter bypass
  3. Blind boolean SQLi
  4. JWT -> IDOR
  5. Business logic vulnerability -> XSS -> JWT -> SSRF -> SQLi.

The fifth lab chains five different vulnerability classes where each exploit unlocks the next step. They can’t skip ahead.

Rules of engagements:

  1. Tools – http_request, submit_flag. No code execution.
  2. Step Budget – 30

All models have interacted with a live locally hosted server serving the vulnerable app, with a small description of the lab, and a tiny hint of where to look, so as not to waste too much budget.

The first lab immediately showed a difference in efficiency. Gemini found the basic ' admin -- in the login page in 4 steps, Claude in 7, and it took 18 steps for GPT to find it!

In the CMDi lab, all three solved in roughly the same number of steps, finding the unsafe concatenation of system commands. Interestingly, Claude decided to not work too hard on finding the format of the flag – and simply ran ‘ls’ and extracted the flag from there.

Here is where it gets interesting. Extracting the flag using the blind SQLi required more budget than I initially gave the models, as a test to see if they find some creative bypasses. They did.

Gemini understood quickly that it needs to do a boolean search of the flag, and presumably recognized that it might have a budget to do so. As such, it decided to batch http requests, bypassed the steps I set up – and extracted the flag after almost 80 requests. GPT recognized this too, but was too conservative with it’s requests, and missed the mark. Claude seemed almost polite in simply manually iterating through it’s budget, failing on step 30.

In the 4th lab, all models recognized there was a vulnerability in the JWT assignment. However, they all hit a wall in correctly computing the JWT with the tools available to them. As such, all 3 failed the lab.

Interestingly, Claude immediately understood this limitation, and tried to creatively bypass that limitation, but ultimately failed.

Naturally, reviewing the limitations and performance of the models thus far – I concluded that the models don’t have enough tools or budget to tackle the fifth and hardest lab, so I stopped the experiment here.

The surprising insights:

  1. Gemini and GPT understood that they are likely to have limited budget to solve the blind SQLi lab – which prompted them to batch requests and allowed Gemini to solve the lab.
  2. Claude was most creative. It quickly figured out the limitation it had with an inability to compute a JWT, and immediately pivoted to look for other workarounds and bypasses.

Labs are available on HuggingFace and GitHub.

submitted by /u/dvnci1452
[link] [comments]

Read More >>

Friday Squid Blogging: Squid Overfishing in the South Pacific

Regulation is hard:

The South Pacific Regional Fisheries Management Organization (SPRFMO) oversees fishing across roughly 59 million square kilometers (22 million square miles) of the South Pacific high seas, trying to impose order on a region double the size of Africa, where distant-water fleets pursue species ranging from jack mackerel to jumbo flying squid. The latter dominated this year’s talks.

Fishing for jumbo flying squid (Dosidicus gigas) has expanded rapidly over the past two decades. The number of squid-jigging vessels operating in SPRFMO waters rose from 14 in 2000 to more than 500 last year, almost all of them flying the Chinese flag. Meanwhile, reported catches have fallen markedly, from more than 1 million metric tons in 2014 to about 600,000 metric tons in 2024. Scientists worry that fishing pressure is outpacing knowledge of the stock. …

Read More >>

Sources: Goldman Sachs, Citigroup, and other banks are testing Anthropic’s Mythos model internally; JPMorgan Chase is the only bank named in Project Glasswing (Bloomberg)

Bloomberg:
Sources: Goldman Sachs, Citigroup, and other banks are testing Anthropic’s Mythos model internally; JPMorgan Chase is the only bank named in Project Glasswing  —  Wall Street banks are starting to test Anthropic PBC’s Mytho…

Read More >>