MCYSEKA-Maritime Cyber Security Knowledge Archive

Last week I spent more time and money than I’m willing to admit trying to make a small AI model very good at CTFs.

Specifically, training it based on the benchmark I created – TarantuBench. That benchmark measures the offensive capabilities of artificial intelligence models using interactive cyber puzzles. Each such puzzle has a unique solution, so you can gauge whether the model succeeded or not through a direct check.

My thesis is the following – if the benchmark measures cyber capabilities, then perhaps it is possible to train a model based on it to perform such puzzles better.

The answer?

Maybe

Of course, I started the hard way. I set up a server in Google’s cloud where the model would try to solve these puzzles over time, and learn from its mistakes and successes. GRPO, for those wondering.

It didn’t work for an engineering reason – I wasn’t convinced that my implementation of this algorithm for the benchmark I built was correct.

I switched to a simpler method. I let the model run on the entire benchmark, took all its solutions, and tried to train it to continue solving in that way and not in another way that leads to errors. SFT of course.

Two problems:

First of all, the data I built wasn’t good. It took me (too) long to figure it out. I took the solutions as they were, without thinking too much about how I would re-feed them to the model so that it would really understand something from this data.

Then, I realized that I didn’t have enough data. I didn’t run the model enough times on the benchmark. At this point, between payments to Google’s cloud, for the model, and for Cursor, I decided that I would end my investment in the experiment.

The result is that every time I trained the model, it failed to exceed its original performance, and sometimes even deteriorated.

What did I learn?

Don’t train on solvers alone. Oracle scripts ≠ agent policy.

Don’t count solves without counting labs. 450 solves on 2 labs is not abundance.

Don’t distill a strong teacher into a weak student without student rollouts. Cross-model SFT is few-shot transfer.

Don’t expect fork rows to replace episodes. Prefix→decision pairs don’t teach horizon control.

Don’t augment your way out of n≈10. Grounding filters and replay repair are hygiene, not data.

Don’t split by run when labs repeat. Lab-disjoint or don’t report generalization.

Don’t chase chains before val singles lift. Composition needs components.

Don’t trust train loss. Track val solve rate and per-lab regressions against base.

Don’t skip the base arm. Every SFT eval should log base=SOLVED|FAIL per lab.

What does this mean?

That the experiment was unsuccessful – not that my thesis is wrong. I don’t plan to end this saga here, but I will take a short break and am sharing with you what *not* to do when you approach training models.

Stay tuned, I’ll try again soon.

Full experiment at tarantulabs.com

submitted by /u/dvnci1452
[link] [comments]

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Global Cyber Security Educational Info Links – real-time news aggregation

World Cup debutants Curaçao clinch historic first point after holding Ecuador to draw

Israel seized Hezbollah underground command center in southern Lebanon

‘Humans, we have arrived!’ Brazilians receive alien invasion alerts

How NOT to Train an Offensive Security AI Agent

Iran says it is closing Strait of Hormuz, testing fragile agreement with U.S. – The Washington Post

Supreme Court of Nepal rules in favour of marriage equality

6/20: CBS Weekend News

Ukraine war live: Zelensky warns of ‘massive attack’ from Moscow – The Independent

Backstage at Gorillaz’ epic, one-off stadium show: ‘The vibe is ridiculous’

Texas Supreme Court denies request to save beach from SpaceX