This started as a response to the recent "you have to post-train a model to pen-test" Show HN — we don't think you need to, just makes life a bit easier.
Across 10K+ of our agent transcripts from benchmarking against OpenAI's EVMBench, we saw zero refusals. In the closed-frontier models, the refusal you hit is mostly a separate content classifier, or a system prompt, not so much the model itself. Breadth (more cheap agents) beats a bigger model, but it puts more requirements on context engineering
This started as a response to the recent "you have to post-train a model to pen-test" Show HN — we don't think you need to, just makes life a bit easier.
Across 10K+ of our agent transcripts from benchmarking against OpenAI's EVMBench, we saw zero refusals. In the closed-frontier models, the refusal you hit is mostly a separate content classifier, or a system prompt, not so much the model itself. Breadth (more cheap agents) beats a bigger model, but it puts more requirements on context engineering
https://news.ycombinator.com/item?id=48609231