Hero image generated by ChatGPT
This is a personal blog and all content herein is my personal opinion and not that of my employer.
This is Part 3 of a 3-part series.
| Part | Title |
|---|---|
| Part 1 | The Leak, the Context, and the Framework |
| Part 2 | Mapping the Trust Boundaries and the Attack Tree |
| Part 3 (this post) | Defending Against Runtime Abuse |
What defenders should actually look for
If you are defending a system like this, or buying one, or integrating one into enterprise workflows, a decent starting point is to stop treating it like a magical box and start asking very ordinary questions.
Watch the credential minting edge
The bridge or equivalent issuance path deserves special scrutiny:
- Does one identity mint one clearly scoped runtime capability?
- Are initial issuance and refresh governed identically?
- Are session ownership checks airtight?
- Are stale tokens invalidated on every meaningful transition?
Treat capability tokens like crown jewels
If a session ingress token can append, retrieve, or otherwise influence session state, then it deserves the same paranoia you would apply to any bearer capability with real operational value: short lifetime, narrow scope, explicit invalidation, careful storage, zero logging tolerance.
Separate control from data when you can
A shared transport for outputs and control messages may be convenient. It is also a coupling decision with security consequences. The more those semantics are separated, the easier provenance, auditing, and failure analysis become.
Instrument state anomalies
A lot of interesting abuse in systems like this will not look like an obviously malicious prompt. It will look like state weirdness:
- repeated bridge refreshes
- out-of-order message or sequence patterns
- mismatched epoch or version usage
- control changes appearing at unusual times
- mode switches that do not line up with user action
- duplicate operation IDs
- stale trusted-device state after account transition
That is detection material.
Be honest about “dangerous by design” paths
If there is a direct-connect mode, a skip-permission flag, or any local-development bypass path, treat it as what it is: an escape hatch. Escape hatches are not necessarily vulnerabilities, but they are still part of the risk surface - especially once products escape the lab and land in messy real-world environments.
What a mature vendor response should look like
If a vendor ships part of its runtime like this, the response should not just be “the source map has been removed.” That is the minimum hygiene fix. The real question is what the vendor does with the accidental transparency.
A mature response would include:
- a packaging and release review for artefact hygiene
- a hard look at bridge/session issuance boundaries
- re-validation of refresh and recovery equivalence
- explicit review of capability-token scope and invalidation
- review of any trusted-device or contextual auth layering
- validation that control-plane message provenance is strong
- a threat model update that treats the runtime as a distributed security boundary, not a chatbot with plugins
In other words: do not only fix the leak. Fix what the leak taught everyone about your system.
Why this leak was strategically useful
There is a slightly uncomfortable truth here for the industry. Leaks like this are valuable because they cut through the marketing abstraction. They show what the product really is.
And what this leak showed was not “wow, clever prompts.” It showed a runtime architecture where the model is only one component, and not necessarily the one with the most interesting security properties.
The model is the glamorous bit. The harness is the consequential bit.
That matters because the industry is gradually standardising around the harness pattern:
- user identity at the front
- token exchange for runtime authority
- remote session or bridge semantics
- local tool execution or mediation
- stateful memory and session continuation
- enterprise policy layered in somewhere, often awkwardly
Once you see that pattern clearly, you stop asking only “can the model be injected?” and start asking “where does capability emerge, how does it move, and what can it steer?”
That is a far better question.
Final conclusions
Starting from the leak itself was the right call. It avoided the usual trap of anchoring on whatever detail the first round of commentary found most meme-worthy.
The three-phase analysis led to a few firm conclusions.
First, this was not primarily a secrets story. The absence of obvious embedded credentials did not make the leak unimportant. It made it more architectural.
Second, the critical seam in a modern agent runtime is the translation from user identity into runtime capability. The bridge/session issuance path looked like the most important security boundary.
Third, once the runtime has session-scoped capability and a transport that carries control semantics, the impact of any upstream confusion rises sharply. A broken prompt filter and a confused control plane are not in the same league.
Fourth, trusted-device or contextual auth layering deserves special attention. Those controls often fail not in the happy path but in the transitions: account switching, re-enrolment, recovery, and “optional-until-it-isn’t” enforcement.
Fifth, the broader lesson is not about Anthropic alone. It is about the direction of the whole space. Agent systems are becoming distributed identity-and-capability systems. That means they will inherit the failure modes of distributed identity-and-capability systems.
Which brings us back to the title.
The model isn’t the risk.
Or at least, not the only one, and often not the most interesting one.
The harness is.
And the harness is just a distributed system pretending not to be.
Appendix: condensed attack tree
For reference, here is a condensed version of the attack tree - the part most reusable when comparing other agent runtimes. The full annotated tree including mitigations at each node is in Part 2.
Appendix: practical questions to ask any agent vendor
- How is user identity converted into runtime capability?
- What exact tokens exist after login, and what can each one do?
- Are session-scoped tokens separate for read and write actions?
- Is refresh behaviour identical to initial issuance?
- What binds transport channels to session, user, and device state?
- Can the transport carry control-plane messages?
- What local runtime behaviours can remote messages alter?
- How is contextual or device trust enforced across initial login, refresh, reconnect, and recovery?
- What direct-connect or bypass modes exist, and how are they contained?
- What telemetry exists for detecting state weirdness rather than just malicious prompts?
If a vendor cannot answer those cleanly, they may not fully understand the system they are selling you. That should concern you more than whether the demo can plan your weekend.
Appendix: why this changes how I think about AI security research
One reason I wanted to document the process, not just the conclusion, is that it exposed a trap that is becoming increasingly common in this space. When an AI security story breaks, the first wave of attention usually locks onto the part that is easiest to screenshot: the odd prompt, the hidden flag, the persona leak, the weird internal codename, the spicy system message, the “look what it said when I asked nicely” moment. Those things can be interesting. They are also dangerously capable of consuming the whole conversation.
The leak made it possible to do something better: start with implementation reality instead of spectacle. That changes the quality of the questions. Instead of asking “what hidden behaviours are there?”, you start asking “what shape of authority does the system create?” Instead of “what else can the model do?”, you ask “what else can the runtime do if a token is replayed, a session is rebound, a transport is confused, or a policy layer drifts?”
It also reinforces a broader lesson that applies well beyond agent products. Security research is often portrayed as a hunt for the clever trick. Sometimes it is. More often, it is a disciplined reduction of uncertainty. The leak reduced uncertainty about how the runtime was assembled. The three-phase method reduced uncertainty further by moving from artefacts, to boundaries, to abuse. The attack tree then turned that into a structured model that can be compared across vendors and products.
The other reason this matters is that agent systems are going to be absorbed into normal enterprise control planes whether the industry is ready or not. They will show up in developer tooling, security tooling, identity tooling, productivity tooling, and workflow tooling. The conversation therefore needs to graduate quickly from “can someone prompt-inject this thing?” to “how does this system mint, bind, propagate, constrain, and observe authority?” That is the level where architecture, assurance, and procurement decisions get made.
So the practical takeaway from this series is not “every agent runtime is broken.” It is “every agent runtime deserves to be analysed like a runtime, not like a novelty chatbot.” That means reading the harness as seriously as you would read any auth-heavy distributed service. It means tracing tokens, state, and control messages with the same care you would apply to an identity provider, a broker, or a remote management plane. And it means being willing to say, out loud, that the most interesting part of an AI system may be the part least likely to appear in the launch blog.
That, more than anything else, is what this leak clarified for me.
AI did not invent new security failures. It mostly rebranded old ones, composed them in novel ways, and wrapped them in enough novelty that people briefly forgot how to recognise them.
The answer is not panic. It is systems thinking.
This analysis has shifted how I view the AI security landscape. We have spent so much time worrying about what an agent might say that we have ignored how it is being told what to do.
When you look at your own internal AI projects, are you spending more time hardening the prompt, or hardening the bridge that authorises the action?
As ever, thanks for reading and feel free to leave comments down below!