The Model Isn't the Risk. The Harness Is (Part 3): Defending Against Runtime Abuse

Hero image generated by ChatGPT

This is a personal blog and all content herein is my personal opinion and not that of my employer.

This is Part 3 of a 3-part series.

Part	Title
Part 1	The Leak, the Context, and the Framework
Part 2	Mapping the Trust Boundaries and the Attack Tree
Part 3 (this post)	Defending Against Runtime Abuse

What defenders should actually look for

If you are defending a system like this, or buying one, or integrating one into enterprise workflows, a decent starting point is to stop treating it like a magical box and start asking very ordinary questions.

Watch the credential minting edge

The bridge or equivalent issuance path deserves special scrutiny:

Does one identity mint one clearly scoped runtime capability?
Are initial issuance and refresh governed identically?
Are session ownership checks airtight?
Are stale tokens invalidated on every meaningful transition?

Treat capability tokens like crown jewels

If a session ingress token can append, retrieve, or otherwise influence session state, then it deserves the same paranoia you would apply to any bearer capability with real operational value: short lifetime, narrow scope, explicit invalidation, careful storage, zero logging tolerance.

Separate control from data when you can

A shared transport for outputs and control messages may be convenient. It is also a coupling decision with security consequences. The more those semantics are separated, the easier provenance, auditing, and failure analysis become.

Instrument state anomalies

A lot of interesting abuse in systems like this will not look like an obviously malicious prompt. It will look like state weirdness:

repeated bridge refreshes
out-of-order message or sequence patterns
mismatched epoch or version usage
control changes appearing at unusual times
mode switches that do not line up with user action
duplicate operation IDs
stale trusted-device state after account transition

That is detection material.

Be honest about “dangerous by design” paths

If there is a direct-connect mode, a skip-permission flag, or any local-development bypass path, treat it as what it is: an escape hatch. Escape hatches are not necessarily vulnerabilities, but they are still part of the risk surface - especially once products escape the lab and land in messy real-world environments.

What a mature vendor response should look like

If a vendor ships part of its runtime like this, the response should not just be “the source map has been removed.” That is the minimum hygiene fix. The real question is what the vendor does with the accidental transparency.

A mature response would include:

a packaging and release review for artefact hygiene
a hard look at bridge/session issuance boundaries
re-validation of refresh and recovery equivalence
explicit review of capability-token scope and invalidation
review of any trusted-device or contextual auth layering
validation that control-plane message provenance is strong
a threat model update that treats the runtime as a distributed security boundary, not a chatbot with plugins

In other words: do not only fix the leak. Fix what the leak taught everyone about your system.

Why this leak was strategically useful

There is a slightly uncomfortable truth here for the industry. Leaks like this are valuable because they cut through the marketing abstraction. They show what the product really is.

And what this leak showed was not “wow, clever prompts.” It showed a runtime architecture where the model is only one component, and not necessarily the one with the most interesting security properties.

The model is the glamorous bit. The harness is the consequential bit.

That matters because the industry is gradually standardising around the harness pattern:

user identity at the front
token exchange for runtime authority
remote session or bridge semantics
local tool execution or mediation
stateful memory and session continuation
enterprise policy layered in somewhere, often awkwardly

Once you see that pattern clearly, you stop asking only “can the model be injected?” and start asking “where does capability emerge, how does it move, and what can it steer?”

That is a far better question.

Final conclusions

Starting from the leak itself was the right call. It avoided the usual trap of anchoring on whatever detail the first round of commentary found most meme-worthy.

The three-phase analysis led to a few firm conclusions.

First, this was not primarily a secrets story. The absence of obvious embedded credentials did not make the leak unimportant. It made it more architectural.

Second, the critical seam in a modern agent runtime is the translation from user identity into runtime capability. The bridge/session issuance path looked like the most important security boundary.

Third, once the runtime has session-scoped capability and a transport that carries control semantics, the impact of any upstream confusion rises sharply. A broken prompt filter and a confused control plane are not in the same league.

Fourth, trusted-device or contextual auth layering deserves special attention. Those controls often fail not in the happy path but in the transitions: account switching, re-enrolment, recovery, and “optional-until-it-isn’t” enforcement.

Fifth, the broader lesson is not about Anthropic alone. It is about the direction of the whole space. Agent systems are becoming distributed identity-and-capability systems. That means they will inherit the failure modes of distributed identity-and-capability systems.

Which brings us back to the title.

The model isn’t the risk.

Or at least, not the only one, and often not the most interesting one.

The harness is.

And the harness is just a distributed system pretending not to be.

Appendix: condensed attack tree

For reference, here is a condensed version of the attack tree - the part most reusable when comparing other agent runtimes. The full annotated tree including mitigations at each node is in Part 2.

flowchart TD A["Valid OAuth token"]:::attack --> B["Bridge credential issuance"]:::attack B --> C["Worker capability credential"]:::attack B --> D2["Stale trusted-device token"]:::attack C --> D["Cross-session credential abuse"]:::attack C --> E["Stale / replaced worker JWT"]:::attack C --> F["Refresh path inconsistency"]:::attack E --> G["Outdated sequence /\ntransport state"]:::attack F --> H["Stale epoch mixed\nwith current"]:::attack G --> I["Dedup / sequence mismatch"]:::attack H --> J["Forged control_request"]:::attack C --> K["Session ingress bearer token\n(logs / memory / intercept)"]:::attack K --> L["Replay session operations"]:::attack K --> M["Retrieve session data\nwithout validation"]:::attack K --> N["Token used outside\nintended scope"]:::attack I --> O["Change model /\npermissions / execution"]:::attack J --> O L --> O P["Active transport access"]:::attack --> J Q["Local runtime access"]:::attack --> R["dangerously_skip_permissions"]:::attack R --> S["Tools without permission gating"]:::attack O --> GOAL1(["Gain control of agent session"]):::goal M --> GOAL2(["Execute beyond intended permissions"]):::goal N --> GOAL3(["Extract session data / outputs"]):::goal S --> GOAL4(["Persistent influence across sessions"]):::goal classDef attack fill:#e879a0,color:#fff,stroke:#c0475a classDef goal fill:#f0047f,color:#fff,stroke:#be185d,font-weight:bold

Appendix: practical questions to ask any agent vendor

How is user identity converted into runtime capability?
What exact tokens exist after login, and what can each one do?
Are session-scoped tokens separate for read and write actions?
Is refresh behaviour identical to initial issuance?
What binds transport channels to session, user, and device state?
Can the transport carry control-plane messages?
What local runtime behaviours can remote messages alter?
How is contextual or device trust enforced across initial login, refresh, reconnect, and recovery?
What direct-connect or bypass modes exist, and how are they contained?
What telemetry exists for detecting state weirdness rather than just malicious prompts?

If a vendor cannot answer those cleanly, they may not fully understand the system they are selling you. That should concern you more than whether the demo can plan your weekend.

Appendix: why this changes how I think about AI security research

One reason I wanted to document the process, not just the conclusion, is that it exposed a trap that is becoming increasingly common in this space. When an AI security story breaks, the first wave of attention usually locks onto the part that is easiest to screenshot: the odd prompt, the hidden flag, the persona leak, the weird internal codename, the spicy system message, the “look what it said when I asked nicely” moment. Those things can be interesting. They are also dangerously capable of consuming the whole conversation.

The leak made it possible to do something better: start with implementation reality instead of spectacle. That changes the quality of the questions. Instead of asking “what hidden behaviours are there?”, you start asking “what shape of authority does the system create?” Instead of “what else can the model do?”, you ask “what else can the runtime do if a token is replayed, a session is rebound, a transport is confused, or a policy layer drifts?”

It also reinforces a broader lesson that applies well beyond agent products. Security research is often portrayed as a hunt for the clever trick. Sometimes it is. More often, it is a disciplined reduction of uncertainty. The leak reduced uncertainty about how the runtime was assembled. The three-phase method reduced uncertainty further by moving from artefacts, to boundaries, to abuse. The attack tree then turned that into a structured model that can be compared across vendors and products.

The other reason this matters is that agent systems are going to be absorbed into normal enterprise control planes whether the industry is ready or not. They will show up in developer tooling, security tooling, identity tooling, productivity tooling, and workflow tooling. The conversation therefore needs to graduate quickly from “can someone prompt-inject this thing?” to “how does this system mint, bind, propagate, constrain, and observe authority?” That is the level where architecture, assurance, and procurement decisions get made.

So the practical takeaway from this series is not “every agent runtime is broken.” It is “every agent runtime deserves to be analysed like a runtime, not like a novelty chatbot.” That means reading the harness as seriously as you would read any auth-heavy distributed service. It means tracing tokens, state, and control messages with the same care you would apply to an identity provider, a broker, or a remote management plane. And it means being willing to say, out loud, that the most interesting part of an AI system may be the part least likely to appear in the launch blog.

That, more than anything else, is what this leak clarified for me.

AI did not invent new security failures. It mostly rebranded old ones, composed them in novel ways, and wrapped them in enough novelty that people briefly forgot how to recognise them.

The answer is not panic. It is systems thinking.

This analysis has shifted how I view the AI security landscape. We have spent so much time worrying about what an agent might say that we have ignored how it is being told what to do.

When you look at your own internal AI projects, are you spending more time hardening the prompt, or hardening the bridge that authorises the action?

As ever, thanks for reading and feel free to leave comments down below!

The Model Isn't the Risk. The Harness Is (Part 3): Defending Against Runtime Abuse

What defenders should actually look for

Watch the credential minting edge

Treat capability tokens like crown jewels

Separate control from data when you can

Instrument state anomalies

Be honest about “dangerous by design” paths

What a mature vendor response should look like

Why this leak was strategically useful

Final conclusions

Appendix: condensed attack tree

Appendix: practical questions to ask any agent vendor

Appendix: why this changes how I think about AI security research

Related Posts

The Model Isn't the Risk. The Harness Is (Part 2): Mapping the Trust Boundaries and the Attack Tree

The Model Isn't the Risk. The Harness Is (Part 1): The Leak, the Context, and the Framework

InsomniHack & Entra Hybrid - Attack & Defence Mind Map : It's easier to attack than you think

Prompting Was Never the …

Beyond The Door #1: The …

The Repository Is The New …

FinOps for Delegated …

The AI Credits Era …

FinOps for Delegated …