A few weeks ago I wrote about what I was then calling the Agile Agentic Development Life Cycle, or AADLC . At the time it was mostly a survival pattern: a way of keeping AI-assisted coding from decaying into one very long, very confident, increasingly haunted conversation where every compromise, half-fix, discarded idea and temporary scaffold remained in the model’s working memory as if it were a sacred architectural decision handed down by the ancients.
The original version of AADLC came from a practical place. I was splitting planning from execution, keeping PRs small, resetting polluted context, using different models for different roles, and trying to stop coding agents from doing the classic “while I’m here” refactor goblin routine. It worked, or at least it worked well enough to make the workflow feel less like negotiating with a charming intern who had swallowed Stack Overflow and more like actually engineering software with tools that happened to be probabilistic.
That was useful and it still is. But it turns out that was only the first layer, because what I had described as an agentic development lifecycle was really the beginning of something bigger, weirder, and much more invoice-shaped:
FinOps for delegated cognition.
Or, less politely:
Why did asking an AI to change one thing apparently cause a small distributed systems incident inside my bill?
Welcome to AADLCv2. Same goblin. More instrumentation.
The billing model changed, and so did the problem
The old mental model for AI coding assistants was simple enough to be dangerous. You sent a prompt. You got an answer. Maybe you used one premium request. Maybe you used a few. Maybe you cursed at the model for confidently importing a package that had not existed since approximately the Bronze Age, but the unit still felt visible. Prompt in, response out, count the request.
That world is ending.
GitHub Copilot’s move toward usage-based AI Credits changes the shape of the problem (it’s not the only AI platform doing so, but it’s the one thats currently most impscting to… ME!) . The cost is no longer represented cleanly by the number of visible prompts or premium requests. The cost is increasingly tied to the total amount of work delegated to the agent: tokens in, tokens out, cached tokens, repository traversal, file enumeration, context hydration, planning, replanning, tool calls, validation loops, retry loops, model-specific behaviours, and all the other machinery hiding beneath the polite little chat bubble.
That is the first important inversion. The old world taught us to count visible interactions. The new world charges us for hidden cognition. And hidden cognition is where the chaos goblin lives.

A single prompt can look tiny. “Add a fetch command.” “Update the README.” “Refactor this helper.” “Make the tests pass.” Cute. Harmless. Practically wearing little shoes. Except the agent does not experience that as a tiny instruction; it experiences it as the start of a hunt. It may read the README, then the package metadata, then the tests, then the existing CLI, then the parser, then the diff engine, then the GitHub instructions, then the AADLC memory, then the current PR contract, then the trust boundaries. It may then plan the work, notice an ambiguity, search again, replan, inspect the tests, modify one file, run the tests, fail, inspect more files, patch another thing, run again, explain itself, and produce a summary.
You saw one prompt. The billing system saw a cognitive execution graph.
That is not inherently bad. In fact, it is the whole reason coding agents are useful. They can do the boring connective tissue work that human engineers often dislike, especially when the repository is unfamiliar. The problem is what happens when that connective tissue work becomes repeated excavation. If the agent has to rediscover the repository from scratch every time, that useful work becomes a tax.
I have started calling that tax semantic hydration.

Semantic hydration: making the agent rediscover gravity
Semantic hydration is the cost of making an agent understand enough of a project to act safely. On a greenfield project, a revived dormant repo, or any repository where the agent has no durable memory, the model must infer what the project is for, what it is not for, where the important files live, which patterns already exist, which tests prove which behaviours, which boundaries matter, which assumptions are safe, which assumptions have teeth, which constraints are durable, and which constraints were only true for a previous PR.
That inference is expensive in several different ways. Sometimes it is economically expensive. Sometimes it is cognitively expensive. Sometimes it is expensive in the “why has this agent decided to rewrite half the repo because it misunderstood a temporary workaround as the architecture” sense, which is a very particular flavour of pain and should probably have its own ICD code.
The problem is not that the agent is stupid (they are and aren’t). The problem is that every session begins with amnesia unless we give it durable structure.
An agent without durable memory must rediscover gravity every session . That is amusing as a phrase. It is less amusing when gravity is hidden inside your usage-based bill.
This is where the framing shifts from prompt optimisation to cognition governance. Prompt quality still matters, obviously. Be specific. Provide examples. Give constraints. Ask for tests. Tell it not to hallucinate, as if models have a checkbox labelled “lie for sport” and we had simply forgotten to untick it. But prompt optimisation alone is no longer enough, because when you are working with autonomous or semi-autonomous coding agents, the prompt is not the whole system. The prompt is an entry point into a larger execution environment.
The better question is no longer merely:
How do I write a better prompt?
The better question is:
How do I constrain the reasoning surface before expensive execution begins?
That is the real AADLCv2 shift. We are not just prompting text generators. We are governing delegated cognition. That sounds grandiose, so let me translate it into normal engineering language: stop letting the agent wander the building with a clipboard and a vague sense of destiny. Give it a map. Give it a task. Give it a contract. Give it stop conditions. Give it tests. Give it a cache of durable truth. Tell it what doors it is not allowed to open without asking.
This is not about making agents less capable. It is about making their capability usable.
The instructions framework is the control surface, not a magic forcefield
The practical foundation for this is my coding-agent-baselines repository, which started as a set of reusable GitHub Copilot coding-agent instruction packs and has now grown into the home of the AADLCv2 governance artefacts. That repo contains the root Copilot operating model, modular instruction packs, and now .github/aadlc/ artefacts for memory, invariants, trust boundaries, tool policy, PR contracts, repo maps, and prompt-as-code plans.
The important thing to understand is that this works by leveraging GitHub Copilot’s instructions framework. I am not patching Copilot itself. I am not adding a sandbox. I am not magically removing capabilities from the agent runtime. I am using the instruction-loading mechanism that Copilot already provides to put a governed operating model in front of the agent before it starts work.
That distinction matters, because it is very easy to overclaim here. When I say “the agent cannot call that tool” or “the agent must not edit that file”, I do not mean I have empirically proven that the underlying platform has made that action cryptographically impossible. I mean the repo now carries explicit behavioural guardrails, contracts, invariants, and validation checks that make the desired behaviour more probable, more reviewable, and easier to detect when violated.
That is not the same as hard enforcement. It is closer to policy-as-instructions plus tests-as-tripwires. The instructions establish the rule. The PR contract defines the permitted boundary. The tool policy classifies the action. The validation checks confirm the diff stayed within the expected file set. The review then has something concrete to compare the agent against. That is a very different posture from “trust me bro, the agent seemed confident.”
This is why I think of AADLCv2 as constrained probabilism rather than deterministic sandboxing. We are not turning the model into a conventional program. We are narrowing the probability space, externalising the rules, and adding enough tripwires that the most likely path is the safe and intended one. If the agent violates that path, we want the violation to be visible immediately, not discovered three PRs later when some deeply helpful “improvement” has quietly turned the repo into a tiny platform nobody asked for.
Security people should find this familiar. A security control does not have to be perfect to be valuable, but it does need to be honest about what it is. AADLCv2 instructions are not a kernel boundary. They are a governance boundary, backed by review, tests, file-set checks, contracts, and increasingly prompt-as-code artefacts. That is still useful. In fact, it is the only reason this approach works at all today.
AADLCv2: the control loop
The updated AADLCv2 loop is simple enough to remember, which is important because a methodology that requires a laminated altar card and three consultants to explain it has already lost. It has five phases: shaping, planning, bounded execution, deterministic validation, and context reset.
Shaping is where you turn “do the thing” into an actual task. You clarify intent, identify boundaries, surface assumptions, and decide whether the work is trivial enough for a direct change or substantial enough to require a plan artefact. Shaping is cheap compared to fixing a model that has gone marching confidently in the wrong direction for twenty minutes. This is where a generalist model is often useful. I do not need a coding agent to tell me whether I am asking the right question; I need something that can help sharpen the question before I hand it to something with tools. This is the “measure twice, let the robot arm cut once” phase.

Planning produces the contract. Not a vibe, not a paragraph of hope, but a contract. The plan should define the goal, non-goals, approved scope, forbidden scope, files expected to change, trust boundaries, invariants to preserve, test strategy, stop conditions, escalation triggers, model fallback strategy, and correction budget. The plan is what makes later review possible. Without it, you are reviewing the agent against whatever it decided the task meant after swimming through a soup of repo context and statistical enthusiasm. With it, you can ask a much better question: did the agent do what the contract said?
Bounded execution is where the coding agent earns its keep. It reads the plan, memory, invariants, and PR contract, then implements only the approved scope. It does not decide that now would be a spiritually fulfilling time to rename the entire CLI, introduce a database, replace the test framework, and add a web UI because “future extensibility”. Future extensibility is where many small tools go to die wearing a tiny enterprise architecture hat. If the box is wrong, stop and amend the contract. Do not silently expand the box.
Deterministic validation is where fluent confidence goes to be humbled. Tests run. Fixtures parse. Commands execute. JSON validates. YAML loads. Network calls are mocked if the contract says they are mocked. If live network access is forbidden, a test should fail if anything tries it. The agent does not get to say “this should work” and leave a glitter trail of optimism across the repo; it either proves the change or reports what could not be proven.

Context reset is the part most people skip, then wonder why the next agent session smells faintly of old assumptions. At the end of a PR, you close the loop. You update durable memory only with stable truths. You record new invariants if they genuinely exist. You remove temporary plan files or move them into the durable plan directory if they are worth keeping. You mark old PR constraints as closed, superseded, or historical. You do not shovel every line of session chatter into memory like a raccoon building a nest out of receipts. Memory is not a diary. Memory is compressed architectural truth.
Cognitive caching: store truth, not chatter
The most practical part of AADLCv2 is probably the least glamorous. Create files. Boring, wonderful files.
.github/aadlc/memory.md
.github/aadlc/current-pr-contract.md
.github/aadlc/invariants.yml
.github/aadlc/trust-boundaries.md
.github/aadlc/tool-policy.yml
.github/aadlc/repo-map.json
.github/aadlc/plans/prN-short-description.md
These files are not magic. That is the point. They are plain-text governance artefacts that make project truth durable and reviewable. A good memory.md contains project purpose, non-goals, architecture summary, core invariants, trust boundaries, known sharp edges, canonical validation commands, operating assumptions, and open questions. A bad memory.md contains everything the agent did today, three obsolete plans, four emotional support assumptions, and a note that “tests were weird” with no explanation.
That is not memory. That is sediment.
The aim is not to maximise context. The aim is to preserve the right context. This is where AADLCv2 starts to look a lot less like prompt engineering and a lot more like infrastructure engineering, because we are building state. We are deciding what is authoritative, what expires, what carries forward, and what needs validation before use.
That is governance.
The PR contract problem
One of the most useful field findings came from a small project called m365-graph-schema-monitor, which revived an old idea: monitoring Microsoft Graph schema changes. The first PR was deliberately constrained. It could parse local CSDL/XML snapshots, diff schema properties, expose inspect and diff commands, use hand-authored fixtures, and run tests. It explicitly could not add live Microsoft Graph fetch, authentication, tenant access, a scheduler, a database, a web UI, changelog correlation, canary tenant logic, or AI summarisation.
That first PR worked well. AADLCv2 did exactly what it was supposed to do. It kept the work small, offline, deterministic, and testable.
Then came PR2, which intentionally added one new boundary: fetch unauthenticated public Microsoft Graph $metadata from fixed allowlisted endpoints. Not arbitrary URLs. Not tenant access. Not Graph SDK magic. Not authentication. Just these two endpoints, with strict validation, redirect rejection, content-type checks, sidecar metadata, offline tests, and no live network tests by default:
https://graph.microsoft.com/v1.0/$metadata
https://graph.microsoft.com/beta/$metadata
This was a good test because it changed the contract. PR1 said “offline only”. PR2 said “we are intentionally adding exactly one constrained network boundary”. One model handled this well. Another model did not.
Codex 5.2 over-anchored on the PR1 contract. It treated “offline only” as if it were durable law rather than a completed PR constraint. It was not reckless; it was obedient in the wrong direction. That is a fascinating failure mode, and I have started calling it stale contract anchoring. The agent preserved constraints, but preserved the wrong constraint.
This matters because AADLC creates more explicit contracts. That is good, but once contracts exist, models need to understand their lifecycle. There is a difference between an active PR constraint, a completed PR constraint, a durable invariant, and an intentional amendment. If we do not distinguish them, an agent can become a tiny bureaucrat guarding yesterday’s scope against today’s approved work, which is funny until you are paying for the correction loop.
AADLCv2 now needs explicit contract lifecycle language: a completed PR contract is historical evidence, not automatically binding scope; only constraints explicitly promoted to durable invariants carry forward; a new PR may intentionally amend a previous constraint when the amendment is explicit, scoped, and recorded.
For example:
The previous PR contract described the completed offline-only foundation.
That completed constraint is historical evidence, not binding scope.
This PR intentionally amends the active contract by introducing exactly one new trust boundary:
unauthenticated HTTPS fetch from fixed Microsoft Graph metadata endpoints.
That wording tells the agent not to discard history, but also not to worship it. Good governance is not just adding rules. It is knowing which rules are still alive.
Prompt-as-code: version the contract
Another field finding was more blunt: long UI prompts can break.
In PR2, a long nested prompt appeared to truncate or misparse in the GitHub agent UI around a deeply nested section. The model repeatedly reported missing content at the same place, which looked less like a reasoning failure and more like prompt transport failure. So the contract was moved into a committed PLAN.md, the agent was told to read the file, and the problem went away.
That is prompt-as-code.
It sounds obvious once you say it. Of course long, boundary-sensitive task instructions should be version-controlled. We already version code, infrastructure, policies, CI workflows and documentation, yet somehow we were still pasting giant operational contracts into chat boxes like medieval monks copying scrolls during an earthquake.
Prompt-as-code gives us exact preservation, diffs, review, line references, reproducibility, model comparison, less UI fragility, and fewer correction loops. For durable plans, I now prefer .github/aadlc/plans/prN-short-description.md. For temporary feature-branch work, a root PLAN.md can be fine as long as it is removed or archived before merge.
This led to one of my favourite tiny recursive moments. While updating the AADLC baseline itself, the coding agent committed a root PLAN.md. That looked wrong against the new rules, but it was actually operating under the old rules at the start of the session. The PR was creating the stricter rule that would later say root-level temporary plans should not be merged. That was not disobedience; it was a bootstrapping gap. The agent cannot obey tomorrow’s law until today’s PR merges.
Somewhere, a constitutional lawyer just felt a disturbance in the force.
The correction budget
Retries are where systems quietly die . That was true before AI, and it is extremely true with agents.
A bad agent session rarely fails all at once. It decays. First the model misunderstands something. You correct it. It accepts the correction, but in a way that bends another part of the task. You correct that. It now has the original task, the wrong interpretation, the correction, the correction to the correction, and a growing pile of conversational sediment. Then it tries to be helpful, because that is what it was built to do.
This is how you end up shepherding the model instead of using the model. That is not agentic development. That is emotional labour with a terminal.
AADLCv2 needs a correction budget. The current rule of thumb is:
One correction is acceptable.
Two corrections means reset.
Three means abandon the session or model.
This is not about being impatient. It is about recognising when the working context has become contaminated. If a model cannot understand the PR contract after one targeted correction, the cheapest path is often to stop, reset, and provide a cleaner contract to a fresh session or a different model. Continuing to patch the same failing frame is how you pay for context rot.
Model fallback without contract mutation
Another real-world finding: model availability is not stable. At one point, Codex 5.3 appeared in the UI but failed with an availability or entitlement-style error while other models worked in the same environment. The lesson is simple: a named model is not an invariant.
This matters because model fallback must not mutate the task. If the preferred implementation model is unavailable, fall back to another model or session, but keep the same goal, non-goals, approved scope, forbidden scope, invariants, trust boundaries, and acceptance criteria. The PR contract is the source of truth. The model is an execution resource. Do not rewrite the contract just because the model changed.
That may sound obvious, but it is exactly the kind of thing that quietly goes wrong under pressure. The model is unavailable, the task is annoying, the deadline exists, and suddenly the fallback path becomes “let us simplify the requirement until the weaker model can do something adjacent”. That is not fallback. That is scope erosion wearing a fake moustache.
Tool-permission tiers: probability, not proof
AADLCv2 also needs a way to decide when tools are allowed. Not every tool action is equal. Reading a file is not the same as editing a file. Editing a file is not the same as changing a workflow. Changing a workflow is not the same as pushing secrets into a public repository while whistling innocently.
The tool tier model I have been using looks roughly like this:
| Tier | Role | Typical tasks | Access |
|---|---|---|---|
| Tier 0 | Shaper | clarify intent | no tool access |
| Tier 1 | Planner | produce scoped contract | read/search only |
| Tier 2 | Executor | bounded implementation | scoped edit and test only |
| Tier 3 | Deep Reviewer | architecture/security review | gated tools |
| Tier 4 | Human | final approval and merge | full authority |
This is deliberately similar in spirit to privileged access thinking. Do not run everything as root. Do not give every phase the same authority. Do not let the model escalate itself because it had a feeling. Deep reasoning and broad tooling should be escalations, not ambient defaults.
But again, honesty matters. When I put “read/search only” in an instruction, I am not proving that the platform has technically removed every possible write path from the model. I am making a rule explicit, making deviation more visible, and adding validation around the expected behaviour. The agent may still have access to a tool surface depending on how the platform presents it; what AADLCv2 does is make unauthorised use contrary to the contract, contrary to the policy, and hopefully caught by the review and validation steps.
That is why the instructions framework matters so much. The current reality is not “the agent can never do X”. The current reality is “the agent has been told not to do X, the contract says X is out of scope, the expected file set excludes X, the tests and review should catch X, and future sessions can learn from X if it happens”. That may sound less satisfying than a hard sandbox, but it is much closer to the truth, and truth is where useful engineering starts.
Deep reasoning must become exceptional, not ambient. Broad tool authority must become explicit, not assumed. If we cannot yet make every unsafe action impossible, we can at least make safe action the path of least resistance and unsafe action a visible contract violation.
That is not perfect. It is progress with guardrails, which is most of engineering if you scrape off the vendor pitch deck.
The field test: did AADLCv2 actually work?
In the m365-graph-schema-monitor field test, AADLCv2 produced two merged PRs. PR1 delivered the Python package skeleton, offline XML parser, schema diff engine, inspect and diff CLI commands, fixtures, tests, README, CI, and AADLC artefact updates. It avoided network access, authentication, tenant access, scheduler, database, web UI, AI summarisation, and platform expansion.
PR2 delivered a controlled fetcher, fixed profile-to-URL mapping, HTTPS and hostname enforcement, redirect rejection, timeout handling, content-type validation, overwrite guard, XML snapshot writing, exact JSON sidecar metadata, offline mocked tests, live network blocking, README updates, and AADLC artefact updates. It avoided arbitrary URLs, authentication, tenant access, Graph SDK dependency, runtime dependency creep, live network tests by default, and parser expansion beyond the PR scope.
That is a good result, not because the code is world-changing, but because the behaviour was controlled. The agents did not build a platform. They did not absorb scope creep. They did not quietly add a scheduler, database, or tenant auth because “it might be useful later”. They delivered small, reviewable, bounded slices.
That is exactly what I wanted AADLCv2 to produce.
What is actually proven?
This is where we need to be honest. I do not yet have enough data to prove that AADLCv2 reduces real AI Credit spend under the new billing model. That measurement starts properly when the billing change is live and comparable usage can be observed, so I cannot honestly claim that AADLCv2 reduces your bill by some neat percentage. Not yet. That would be marketing, and marketing is just hallucination with a budget.
What I can claim is stronger in a different way:
AADLCv2 makes delegated cognition governable.
That is now field-tested. The evidence supports that agents behave better when durable memory exists, explicit PR contracts reduce scope ambiguity, prompt-as-code solves real prompt transport issues, stale contract anchoring is a real failure mode, correction budgets prevent session rot, model fallback needs contract preservation, cognitive caching reduces repository archaeology, and deterministic validation keeps fluent confidence honest.
The cost graph is not proven yet. The containment field is .
The next phase is measurement. If AADLCv2 is FinOps for delegated cognition, then it needs a ledger. Useful metrics include AIC per merged PR, AIC per accepted task, AIC per failed task, AIC per model, AIC per repo, first-session AIC, resumed-session AIC, files read before first plan, repeated archaeology prompts, correction prompts per task, session resets, contract amendments, scope violations caught before implementation, stale memory detections, repeated rediscovery events, clean checkout validation, network guard failures, and post-merge fixes.
This is the difference between a methodology and a dashboard. AADLCv2 is not complete until it can observe itself, because of course it isn’t. The recursion goblin demands tribute.
How I am using it now
The practical workflow now has two paths. For small, obvious tasks, I shape quickly, let the agent inspect relevant files, make the change, run targeted validation, and update memory only if a durable truth changed. The point is not to drown tiny work in ceremony. Not every button needs a change advisory board and a violin solo.
For substantial or boundary-sensitive tasks, the process is more deliberate. I shape the task conversationally, generate a plan or spec, commit it as a plan artefact, ask the agent for plan-only output first, allow at most one corrective planning prompt, approve implementation only when the plan matches the contract, review the diff against that contract, validate deterministically, remove or archive temporary plan files, and update durable memory only with stable truth.
This is not slower in practice. It feels slower in the same way writing a good change request feels slower than SSHing into production and “just fixing it”. The latter feels fast right up until the retrospective. The former is boring, but it compounds, and compounding boring is most of engineering.
AADLCv2 as constrained probabilism
The deeper theoretical frame here is what I have been calling Constrained Probabilism. Large language models are probabilistic systems operating over vast possibility spaces. They are powerful because they can infer, connect, generalise, and generate. They are dangerous, expensive, or merely annoying when the possibility space is too wide and the constraints are too weak.
AADLCv2 is an operational implementation of constrained probabilism. It reduces uncertainty surfaces, externalises invariants, bounds execution, preserves durable context, narrows tool authority, forces validation, and prevents yesterday’s temporary scope from becoming tomorrow’s accidental law.
This is why AADLCv2 feels so familiar to security engineers. We already know how this story goes. Unbounded authority is dangerous. Implicit trust is dangerous. Untracked state is dangerous. Ambiguous boundaries are dangerous. A system that can act needs controls. A system that can spend needs FinOps. A system that can infer needs constrained probabilism.
Identity is infrastructure. Apparently cognition is too now.
The uncomfortable bit
The uncomfortable bit is that the industry is still mostly talking about AI coding in terms of productivity. How much faster can it write code? How many developers can it augment? How many tickets can it close? How much boilerplate can it vaporise?
Those questions are not wrong. They are just incomplete.
A better set of questions might be: how much hidden work did the agent perform, how much of that work was useful, how much was rediscovery, how much was retry, how much was caused by ambiguous framing, how much came from stale context, how often did the agent cross a trust boundary, how often did it need human correction, how often did we reset the session, how often did it change files outside intended scope, and how much did semantic maturity reduce cost over time?
That is the agentic FinOps conversation. Not “are agents useful?” They are. The question is whether we are governing them like engineering systems or treating them like clever interns with a corporate card.
So, is AADLCv2 real?
Yes. Not in the sense that it is finished, and not in the sense that I have a statistically significant dataset across fifty teams, seven languages, three quarters of billing telemetry, and a ceremonial spreadsheet blessed by Finance. But in the more important early sense: it explains observed behaviour and produces better outcomes when applied.
That is how practical engineering frameworks begin. Something hurts. You notice a pattern. You build a control. You test it. It fails in interesting ways. You update the control. You repeat until the system becomes less mysterious.
AADLCv2 has already done that. It explained why agentic work felt expensive and unstable. It produced a better repo baseline. It improved agent behaviour. It caught stale contract anchoring. It gave prompt-as-code a clear role. It made model fallback safer. It turned retries into a stop signal instead of a lifestyle.
That is enough to keep going. The next step is to measure the economics properly.
Final thought
The original AADLC post was about sculpting with agents. This follow-up is about building a workshop where the chisels have labels, the expensive tools require permission, the plans are pinned to the wall, the measurements are written down, and nobody has to rediscover what marble is every morning.
That is less romantic. It is also how real work gets done.
AADLCv2 does not make agents perfect. It makes their failure modes diagnosable, bounded, and recoverable. The point is not to make agents less capable. The point is to stop making them rediscover gravity.
And if, as the usage-based billing era arrives, that also stops them from turning every innocent little prompt into a surprise invoice with opinions?
Even better.