AI in Cybersecurity: What We Presented at Cybersec Europe

On 20 May 2026, Ingram gave the opening research talk at "AI in Cybersec", a private side-event of the AI in Defence Summit series, held in the Cybersec Europe VIP Lounge in Brussels. This is a short recap of what we presented and why it matters.
Download the slides (PDF)
Where we were
The evening was part of the AI in Defence Summit series, organised by Seven Events alongside Agoria, with Cybersec Europe as host partner. We were on first, with a research talk on what AI has done to the offensive and defensive security balance over the past few months. The short version of our argument: the transition is happening now, faster than most security teams have planned for, and the numbers are no longer speculative.
The starting point: who writes the code now
We opened with a single recent example. In mid-May, Anthropic merged the Rust rewrite of Bun, the JavaScript runtime that powers Claude Code and the Agent SDK. The rewrite is roughly a million lines of code, written by AI agents in six days, with the test suite passing at 99.8%. It also contains more than 13,000 unsafe blocks, in a language whose entire purpose is to avoid them.
That is the loop worth sitting with: a runtime largely written by AI now runs the AI agents that write more code. It updates the old "trusting trust" problem that Ken Thompson posed in 1984. His question was how deep you have to audit the code you run. The new question is who, or what, wrote it in the first place, and whether you could tell. Audit as the industry has practiced it for two decades no longer reaches all the way down. The code is not worse. There is simply far more of it, written far faster.
We were candid that this applies to us too. Most of the code we ship is AI-written and most of our review is AI-assisted, and we suspect that is quietly true of most teams. That is exactly why we have been building governance and observability for AI agents, so the audit trail records what an agent did to produce a result, not just the result. We wrote about that work in Announcing Ingram Cloud.
What AI actually finds in real code
The core of the talk was the evidence, drawn from recent published research rather than marketing.
Project Glasswing and Claude Mythos. Anthropic's Mythos model, evaluated under Project Glasswing, found a 27-year-old bug in OpenBSD's SACK implementation, a 16-year-old vulnerability in FFmpeg's H.264 codec (one of the most-fuzzed codebases in existence), and produced a fully autonomous remote-code-execution exploit against FreeBSD's NFS server. On Anthropic's internal OSS-Fuzz benchmark, the previous-generation model managed roughly one control-flow hijack; Mythos managed ten. Cost per zero-day discovery landed somewhere between $50 and $2,000.
This is not one lab. The UK AI Security Institute put OpenAI's GPT-5.5 at parity with Mythos on expert cyber tasks (71.4% against 68.6%), and both are the only models to have completed AISI's full 32-step corporate-network attack simulation end to end.
ExploitGym. The clearest single datapoint is an independent benchmark published on 11 May 2026 by a 17-author team spanning UC Berkeley, the Max Planck Institute for Security & Privacy, UC Santa Barbara, and researchers from Anthropic, OpenAI, and Google jointly. It is 898 real-world exploitation tasks, scored on working exploits within a two-hour budget per task.

Two results stood out to the room. First, with ASLR, the V8 heap sandbox, and KASLR all enabled, the mitigations that have anchored twenty years of memory-safety thinking, Mythos still produced 45 working exploits and GPT-5.5 produced 21. The paper's own conclusion, co-authored by all three labs, is that "current mitigations alone are likely insufficient to neutralize AI-driven exploitation."
Second, the agents routinely find bugs nobody pointed them at. Mythos captured the flag on 226 tasks but only 157 were the intended bug; on the other 69 it found and exploited a different vulnerability in the same target. These agents are auditing and fuzzing on their own, not pattern-matching to known exploits.
We have seen the same shape in our own engagements, both on our own application and on a live fintech app: the agent handles the volume, a human carries the judgment, and the cost of a real assessment drops sharply.
It is already in the wild
On 11 May, Google's Threat Intelligence Group disclosed what it assesses as the first AI-built zero-day operation, in which attackers used OpenClaw to find and weaponise an unknown vulnerability, with strong interest from groups linked to China and North Korea. So this is no longer a question of whether the capability will be used. It was used the week before the event.
The defender economy is already feeling it. HackerOne told researchers this month that triage times have slipped and submission volume is surging, and is rewriting its Code of Conduct for AI-assisted reports. The N-day gap between disclosure and a weaponised exploit, long the defender's grace period, now closes to under a day for under $2,000.
What we told the room to do this week
- Run frontier models over your own code now. You do not need Mythos; broadly available models will find serious bugs.
- Shorten your patch-enforcement window. Treat CVE-fixing dependency bumps as urgent.
- Refresh your disclosure policy for the inbound volume you are about to see.
- Automate incident-response triage. Volume will rise faster than headcount.
- Build the AI-in-security muscle now, not during the incident.
Most security tooling has eventually favoured defenders. Fuzzing did, static analysis did, and there is good reason to expect AI will too. But the damage gets done during the transition, and the transition started this year.
Thanks
Thank you to Sona Stepanjan and Seven Events for organising the evening alongside Agoria and Eric Van Cangh, and to Cybersec Europe for being a great partner throughout. Thanks also to our mystery guest speaker, Marijn Markus, whose talk genuinely rocked the room.
The full deck is linked above. If the topic is relevant to your team, we are happy to talk.
If this is a conversation you are already having internally, whether about AI-driven security testing or about governance and observability for the agents your teams are running, we would be glad to compare notes.
Get in touch
