AI Pentesting in the Wild: What We Found on a Live Fintech App

A week ago, an AI agent read the public JavaScript bundle of a production fintech app, noticed a suspicious cookie, flipped one UUID inside it, and logged in as someone else's account. No credentials, no exploit out of a CVE database. Just a cookie the server trusted because nobody had told it not to.

That's the short version of what we want to talk about. The long version is more interesting, because the same week surfaced two more issues that would each have been worth the engagement on their own, and because the workflow we used to find them is genuinely different from what a human-only team would have done six months ago.

The setup

We wanted an honest answer to a specific question. AI agents are already finding zero days in open source projects. Are they good enough yet to make real pentesting accessible to small and mid-sized companies, the kind that currently skip it because a traditional engagement runs €20k and three weeks? How close are we to useful automation, and where is the human still the bottleneck?

The target was financica.app, a brand new multi-tenant financial management SaaS preparing to handle real invoices, real expense data, and real bank integrations for real organizations. The owner consented to the assessment and to this writeup. We ran two well-known AI pentest agents, Shannon and Strix, against it in a supervised three-phase engagement.

Total elapsed time from kickoff to final report: four working days. For context, a comparable manual engagement on a target this size typically runs two to three weeks.

The workflow

Left unsupervised, AI agents are noisy. They spin up test accounts, trip rate limiters, rattle error alerts, and sometimes write state to production in ways that make reproducibility impossible. The single biggest lesson from this engagement is that the workflow around the agent matters more than which agent you pick.

The loop we settled on:

Recon (AI, passive) → Plan (Human) → Execute (AI, supervised) → Validate (Human) → Report (Human + AI)

Recon runs passively and produces output the human uses to prioritize. Planning is where business context enters: a payments endpoint matters more than an avatar upload, and only a human who knows the product can say that. Execution is where the AI earns its keep, because the tedious parameter-fuzzing and identifier-swapping that takes a human an afternoon takes an agent minutes. Validation is non-negotiable. The agent's raw output is a list of hypotheses, not a list of vulnerabilities. Reporting is mostly editorial.

One line worth underlining. The tool only performs as well as the operator. Pointing an AI at a target with no methodology gets you a shelf full of test accounts and a folder full of false positives. The workflow is the product.

Recon

Before making a single adversarial request, the AI crawled the app, extracted every API route reachable from the JavaScript bundle, and fingerprinted the stack from response headers and client config. Inside eleven minutes it had identified a Next.js frontend backed by Supabase, enumerated 83 distinct API endpoints, and catalogued the third-party integrations (Stripe, Resend, one analytics vendor) along with the public Supabase anon key embedded in the client.

A competent human would have found the same things in maybe two hours. The value of the agent here is completeness. Every endpoint ended up in the map, not just the ones that looked interesting at first glance, which meant the testing plan could be built from a full surface rather than a sampled one.

We got explicit permission from the owner to publish this walkthrough. Nothing here runs without consent.

Finding one: the invitations table

The first real hit came from a single unauthenticated request using nothing but the public anon key visible in the client bundle. The response contained every pending organization invitation on the platform: invitee emails, issuing organization, and the plaintext invitation token that would let the holder accept on the invitee's behalf.

The underlying issue is the Supabase failure mode that trips up most teams at least once. The anon key is designed to be public, and every security guarantee in a Supabase app hangs off Row-Level Security policies on the underlying tables. The invitations table had been created during early development, RLS had been enabled at the table level (so queries wouldn't hard-fail), but no SELECT policy had ever been written. The net effect is that "RLS enabled with no policies" looks identical to "deny-all" for most queries, until the day someone queries the table from the client and finds out it isn't.

What a safe version looks like: a policy that restricts SELECT on invitations to either the issuing organization's members or a user whose email matches the invitee address. Two lines of SQL. The fix shipped within a few hours of the finding.

Finding two: the storage bucket chain

The second finding on its own would have been medium severity. Chained with the first, it became critical.

The app stored uploaded receipts and invoice PDFs in a Supabase storage bucket. The bucket was configured as public, on the assumption that object names were unguessable UUIDs. That assumption holds until someone can enumerate the UUIDs, and the invitations leak from finding one gave the agent a starting set of organization IDs, which surfaced in file paths elsewhere in the API. A few minutes of correlation and the agent had an unauthenticated path from "I know nothing about this app" to "I am downloading another organization's financial documents."

The lesson worth taking away: storage buckets want RLS too. storage.objects is a table like any other, and the correct pattern is a policy that checks the requesting user's organization membership against the path prefix of the object being read. Public buckets are appropriate for genuinely public assets (marketing images, shared avatars), and inappropriate for anything whose secrecy depends on URL obscurity.

Finding three: the impersonation cookie

This was the one that changed the room when we validated it.

While crawling the authenticated application, the agent noticed a cookie called impersonate_user being set on admin-facing pages. The value was a bare UUID. No signature, no HMAC, no session binding, no server-side check on whether the requesting account held an admin role. The feature had presumably been built for the support team to reproduce customer bug reports, and at some point the authorization check either never got written or got removed during a refactor.

The agent's hypothesis was straightforward. If the server trusts the cookie without checking who set it, any authenticated user can set it to any user's UUID and the server will serve that user's pages. The agent flipped the UUID to one we controlled on a second test account, sent the request, and got back a dashboard belonging to a different tenant, complete with balances, bank account last-four digits, and the other user's email rendered into the HTML.

We validated the full scope manually against two internal test accounts before writing it up. Any paying customer could impersonate any other customer. Reproducible, unambiguous, immediate disclosure.

What a correct version of this feature looks like:

The impersonation token is a signed JWT (or equivalent), not a raw identifier, so the client cannot forge it.
The server verifies on every request that the signing subject holds a specific support or admin role, and that the target user has not opted out.
The impersonation state is logged, auditable, and surfaced visibly in the UI so the impersonator can never forget they are not themselves.
The token is short-lived and bound to the original session, not a long-lived cookie that survives logout.

None of this is exotic. All of it is standard practice for support-tool impersonation flows. The lesson is that a feature touching authorization needs to be treated as security-critical even when it was built for internal convenience, because at sufficient scale there is no such thing as internal.

Validation and reporting

Both phases were faster than I expected, and for the same reason: the agent produced drafts, the humans edited. The raw white-box pass flagged 25 items. Human review confirmed 10 of them as real, promoted two to critical, and dropped the rest as false positives, duplicates, or theoretical issues with no practical path to exploitation. The black-box phase surfaced 7, scoped deliberately tighter because we knew the white-box pass was coming. After deduping across phases we had twelve distinct findings.

Severity	Count
Critical	6
High	3
Medium	2
Low	1

For the report itself, the agent generated the skeleton: severity ranking, reproduction steps, remediation notes per issue. The human work was editorial. Which proofs-of-concept needed to be sanitized before publication, how to frame the executive summary without overstating the risk or burying the lede, what the owner needed to prioritize in the first 24 hours versus the first two weeks. Those calls require judgment and context, and the agent is not close to making them.

Cross-checking with the white box

After the external pass we re-ran the same target with source code in hand. The exercise was worth doing for one reason: when a finding surfaces independently from both directions, you can ship it with confidence that you are not chasing an artifact. All six critical findings appeared in both passes.

Where the passes diverged, they diverged in expected ways. The source review found implementation details no black-box test could reach: a debug endpoint behind an environment flag, a couple of injection sinks unreachable from the current UI but one form away from becoming live. The external pass confirmed which source-level smells were actually exploitable and which were theoretical.

One practical caveat. White-box AI agents do not parallelize as well as the marketing suggests. Running too many concurrently, we saw agents write outputs to paths other agents could not locate, produce duplicate findings under different titles, and occasionally lose their own scratch state. Run them in smaller supervised batches.

What to audit in your own app this week

If you are running a Supabase-backed app, the findings above generalize cleanly. Five things worth checking today:

For every table in your schema, confirm that RLS is enabled and that at least one policy exists per operation you want to allow. "RLS enabled with no policies" is not the same thing as "locked down"; it just means your queries quietly return empty results until the day they don't.
For every storage bucket, decide whether it is genuinely public (marketing assets) or access-controlled (anything user-specific), and write storage.objects policies for the latter. Path obscurity is not a security control.
Audit every cookie and header your server trusts for authorization decisions. Anything the client can set that the server uses to gate access must be signed, verified, and scoped.
Grep your codebase for impersonation, "sudo", "assume role", or support-tool features. Treat them as security-critical even when they live behind internal URLs. At sufficient scale there is no internal.
Open the JavaScript bundle your site ships to the browser. Whatever you see there, an attacker sees too. Anything sensitive in the bundle is already leaked.

None of these are exotic. All of them are the kinds of issues that ship in production weekly, and all of them are findable in an afternoon if you know where to look.

Why this matters

Six critical findings on a production fintech app is the visible result. The interesting part is the cost curve behind it.

A traditional pentest on a target this size runs €15k to €25k and books two to three weeks of calendar time. The engagement described here took four working days and spent the expensive human hours on the parts of the work where human judgment is actually load-bearing (prioritization, validation, disclosure). The mechanical parts, the ones that used to eat half the engagement, compressed to minutes. For a small SaaS shipping its first production release, that is the difference between getting a security review and skipping it.

The limits are real. Unsupervised agents will break production, miss business-context bugs, and generate confident nonsense at a rate that makes unreviewed output worse than useless. Full automation is not close. But the shape of the job is changing, and anyone doing security work on small and mid-sized applications should be running this kind of workflow now rather than waiting for the next tool release.

If you want to see what your own application looks like from the outside, we run free recon demos. It is the single highest-leverage thing you can do for your security posture this quarter.

All findings were responsibly disclosed, remediated, and verified fixed before publication. Publication was coordinated with the target's consent.