MCYSEKA-Maritime Cyber Security Knowledge Archive

Follwing up on my recent post [how NOT to train an offensive ai model], I continued doing this experiment to see what more there is to learn about this process.

Tl;dr:

Using data derived from real solutions for interactive CTF labs as training data for LLMs produce surprisingly different results depending on the training data. As this is an interactive process, fully logged and transparent, one can learn a lot about the different failure modes that arise from different forms of the training data. More, elaborated below.

After building what I believe is the best training data I could for this task, as derived from my own benchmark, and running an evaluation of the SFT model (Gemma*, distinct from Gemma base), it appears to be more reliable and successful in solving most single-vuln labs (maxing out some of them, which impacted precise measurement), solved more chain-vuln labs, in fewer steps, and being more deterministic in its solutions.

The method of evaluation here is a standard split/val/train of all the labs I currently have.

Multiple attempts have been made to validate this behavior outside of my own benchmark, in an attempt to replicate this in 3rd party environment as well.

I could not do so reliably and at-scale – so take these results with a grain of salt.

—

There are multiple ways to improve a model in an interactive learning environment. The leading methods are:

Using a teacher – a larger model whom the smaller one will imitate.
Self-play – the model solves the tasks, and learns from its own solutions
Imitation of human solutions.

I chose neither.

My goal was to build a framework that will, for any given model M, produce a model M*, which is better at web exploitation.

Neither of the methods above provide that solution.

My approach was to use the actual solutions I have for the labs. The advantage for this approach is that one is adding more information to the system that is directly derived from a truth source about the environment it’s attempting to solve. The disadvantage is, that truth is often not behaviorally aligned with how a human or AI interacts with the app.

The solution for this problem, in short, is to take that source of truth and transform it into something that more closely resemble how an actual exploitation looks.

Finding this solution required iterating over how exactly I think this transformation should look. This iteration showed interesting behavior along the way.

Essentially, given the right training data, one could tune a knob and make the model more recon-heavy, payload-focused, or, of course, generically worse than the base model.

I’ve divided this behavior internally into a few buckets, which helped me during this process.

After I settled on what I think is the most balanced and representative dataset of live, interactive, web exploitation – I kicked off doing supervised fine-tuning for the model.

I then evaluated the new model, Gemma* against Gemma base, on many thousands of runs through the val and test splits.

The results are largely positive. On the sub-set of the labs which actually measure generalization, and not memorization, Gemma* consistently beats Gemma. So much so, that my evaluation data is skewed because for labs that Gemma has scored ~80% on, Gemma* consistently got 100%. This skews the results because the improvement could be more than +20pp, but I could not see it under this circumstance.

They’re also positive compared to scale – 64 training labs total. Generally, in attempts to fine-tune AI models of this type, the number I used is 2-3 orders of magnitude smaller than normally accepted.

Which raises my next point about data scarcity.

There is no public, open-source, audit of full-trace to solve CTFs. Unlike coding and other agentic tasks, where there’s a lot of data out there, this format of data is scarce. Specifically, what is scarce is a known, correct, deterministic solution trace for a given CTF.

On principle, I could have automatically built thousands of additional labs – it would have taken me a day – but that wasn’t quite what I was looking to do.

Bottom line:

It appears that, thanks to this data I’ve collected, I was able to get a net positive result on this training run. If I do decide to push up the scale, and perhaps invest more money and train a model larger than Gemma, I could possibly detect some additional improvements that were out-of-scope of the scale of this experiment.

More specifically, this access to correct and grounded results of CTFs proved valuable in this training, in a way that I think simple write-ups for known exploits would not have been.

I used the TarantuBench benchmark in this research, and all interactive labs are available on tarantulabs.com

submitted by /u/dvnci1452
[link] [comments]

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Global Cyber Security Educational Info Links – real-time news aggregation

US Says It Is Mobilizing Assistance for Venezuela After Earthquakes

Iran’s neighbor, U.S. ally: What Pakistan gains from being a peacemaker

Supervised Reinforcement Learning for LLMs on CTF Labs

Iran declares new Hormuz route ‘unacceptable and dangerous,’ warns against ships transiting without approval

South Africa stun South Korea to reach World Cup knockouts for the first time

Iran warns ships it’s ‘unacceptable and dangerous’ to transit the Strait of Hormuz without their approval

South Africa beat South Korea to reach World Cup knockout stages for first time

“Don’t Even Know How Long It Lasted”: Venezuela Residents Recount Earthquake

Investors still seek a human touch even with AI tools at hand: HSBC

Venezuela earthquake latest: Thousands feared dead as casualties grow after Caracas hit by back-to-back 7.5 magnitude shocks