The Wrong Latency Question
When people talk about latency in AI systems, they usually mean network latency.
How many hops?
Local or remote?
HTTP or stdio?
Milliseconds or seconds?
That’s a reasonable place to start - but it’s the least interesting part of the problem.
Recently, a series of discussions with colleagues, external posts, and hands-on experiments have all converged on the same uncomfortable conclusion:
The latency that hurts most in MCP-based systems is not transport latency.
It’s semantic latency.
And semantic latency compounds in ways that don’t show up in demos, benchmarks, or happy-path diagrams.
A Simple Question That Wouldn’t Stay Simple
It started, as these things often do, with an innocent architectural question:
Is a local MCP server over stdio actually “better” than a remote MCP server over HTTPS?
On the surface, the answer seems obvious:
- In both cases, you eventually hit the network
- The upstream API dominates latency
- A local MCP doesn’t magically make SaaS faster
And yet… the discussion refused to collapse into “it doesn’t matter”.
Because what kept coming up wasn’t average latency - it was where uncertainty enters the system, and how early it starts to hurt you.
Distributed Boundaries Are Not Free
A useful way to frame the two common MCP deployment patterns is this:
Local MCP (stdio / IPC)
Client → local IPC → MCP → network → upstream API
Remote MCP (HTTPS)
Client → network → MCP → network → upstream API
Yes, both paths include a network hop.
But the second path introduces an additional distributed boundary earlier in the chain.
That boundary comes with:
- serialization and deserialization
- retries and timeouts
- jitter and variance
- authentication and trust assumptions
- failure modes that are harder to reason about
None of this is controversial in distributed systems. We’ve known for decades that:
Latency isn’t additive - it compounds.
And yet, even that framing turned out to be incomplete.
Because it still assumes that the dominant cost is the network.
It isn’t.
The Latency You Don’t See in Traces
One of the most interesting posts to surface recently was Rob Murphy’s MCP Doesn’t Scale and Anthropic Just Admitted It.
What’s notable about that post is what it doesn’t focus on.
It isn’t complaining about:
- HTTP overhead
- JSON-RPC framing
- socket performance
Instead, it highlights something far more damaging:
MCP introduces cognitive and token overhead that quietly destroys performance in production-grade agent systems.
In Rob’s testing, tasks that required ~1,000 tokens via a clean REST interface ballooned to ~20,000 tokens when mediated through MCP tool schemas and definitions.
That difference wasn’t about transport.
It was about attention.
Semantic Latency: The Hidden Multiplier
Let’s name the thing properly.
Semantic latency is the delay introduced when a model must:
- reason about large, verbose tool schemas
- decide which tool to use
- interpret repeated definitions
- maintain relevance across long contexts
This latency shows up as:
- longer inference time
- higher variance
- increased hallucinations
- incorrect tool selection
- partial or unusable outputs
And crucially:
Semantic latency doesn’t just slow answers - it increases retry rates.
Retries are where systems quietly die.
Determinism Is an Architectural Expectation, Not a Model Property
If you’re using MCP at all, you are implicitly seeking deterministic outcomes.
What teams are really trying to achieve with MCP isn’t deterministic models, but something closer to constrained probabilism: probabilistic reasoning engines operating within deliberately narrow, deterministic system boundaries.
Unconstrained probabilism maximises model freedom; constrained probabilism maximises system predictability.
You don’t want:
- “kind of right”
- “creative interpretation”
- “plausible but wrong”
You want:
- the right tool
- with the right parameters
- every time
But LLMs are probabilistic by nature.
So what do we do?
We add:
- guardrails
- schema validation
- retries
- fallback prompts
- repair loops
- temperature tuning
Each retry adds:
- inference latency
- token cost
- tail latency variance
This is why two systems with identical mean latency can feel radically different in production.
One retries once in a hundred requests.
The other retries one in five.
Only one of them scales.
“It’s an Agent Design Problem” (And Why That’s Not the Whole Story)
A common response to MCP criticism is:
“MCP doesn’t force you to include all tools in context. That’s an agent design problem.”
This is technically correct.
It’s also incomplete.
Saying this is like saying:
“Microservices don’t add latency - bad service design does.”
True at the spec level.
Misleading at the system level.
Protocols shape behaviour.
MCP:
- normalises verbose tool schemas
- encourages discoverability over minimalism
- pushes reasoning into the model instead of the code
Yes, you can design around this.
But when the creators of MCP themselves quietly advise:
- generating code files instead of tool calls
- browsing directories instead of invoking schemas
- dropping MCP connections to reclaim token budget
That’s not user error.
That’s architectural tension.
Context Rot Is Not Forgetting - It’s Drowning
One of the most useful concepts to emerge in recent LLM research is context rot.
Context rot isn’t about memory limits. It’s about attention dilution.
As context grows:
- important signals lose salience
- irrelevant structure consumes attention heads
- recall accuracy drops even as information increases
- hallucinations increase - remember LLMs Are goal oriented and their primary goal is to be helpful, even if that means inventing “truth” rather than admitting defeat
MCP, when used naïvely, accelerates context rot:
- repeated schemas
- large static definitions
- wide tool surfaces
The model isn’t forgetting.
It’s drowning.
Enter Recursive Language Models (RLM)
This is where things get genuinely interesting.
Recent work on Recursive Language Models (RLMs) proposes a radically different approach:
Instead of stuffing everything into one giant prompt, the model:
- queries its own context
- revisits relevant slices
- re-invokes itself with narrowed focus
- externalises state instead of internalising it
The key insight is this:
Keep semantic state, not raw transcript.
RLMs don’t eliminate recursion. They discipline it.
And the effect is dramatic:
- context rot collapses
- accuracy improves
- variance drops
- fewer retries are required
Which brings us back to latency.
Why RLM Improves Latency Without Being “Faster”
RLM doesn’t necessarily reduce single-shot inference time.
What it reduces is failed trajectories.
That matters far more.
RLM doesn’t just improve accuracy - it reduces how often the system has to apologise and try again.
Most production latency is not:
- “how fast did the model answer?”
It’s:
- “how many times did we have to ask again because the answer was unusable?”
RLM reduces:
- hallucinated tool calls
- mis-grounded reasoning
- partial outputs
- repair loops
Which means:
- fewer retries
- tighter tail latency
- more predictable systems
This is latency engineering by variance reduction, not raw speed.
MCP, Revisited (With Better Questions)
With all of this in mind, the right questions about MCP are no longer:
- Is stdio faster than HTTPS?
- Is JSON-RPC expensive?
- Does local beat remote?
The better questions are:
- Where does semantic uncertainty enter the system?
- How early does probabilistic drift start?
- How many retries does this architecture require under stress?
- What does this force us to do with model size and cost?
- Are we optimising for exploration or production?
When you ask those questions, a pattern emerges.
Where MCP Shines - And Where It Hurts
MCP works best when used as a contract boundary for constrained probabilism, not as a universal reasoning substrate.
MCP is genuinely excellent for:
- exploratory coding agents
- schema discovery
- complex tool surfaces
- development-time workflows
- frontier models with large attention budgets
It struggles with:
- latency-sensitive production agents
- high-volume automation
- small or efficient models
- deterministic operational paths
That’s not a failure.
It’s a fit problem.
The Real Lesson: Choose Where the Pain Enters
You cannot eliminate:
- latency
- trust
- uncertainty
- probabilistic behaviour
But you can choose:
- where they enter the system
- how early they appear
- how visible they are
- how much they compound
A local MCP over stdio doesn’t remove network latency. But it delays distributed uncertainty.
A clean REST tool doesn’t remove model error. But it narrows the reasoning surface.
An RLM doesn’t make models deterministic. But it localises randomness.
Each of these choices reduces variance.
And variance is what kills systems at scale.
Capability Is Not Obligation (Again)
MCP is powerful. RLM is powerful. Large context windows are powerful.
That does not mean they belong everywhere.
Just because a system can reason about everything doesn’t mean it should.
The most reliable systems are not the cleverest ones. They are the ones that are boring, constrained, and predictable in the places that matter.
Use MCP deliberately. Use RLM strategically (once it is readily available). Use complexity where it buys you something.
And above all:
Design for the agents you actually need - not the ones that demo well.