Part 3 of 5 | Series: What We Learned from the Claude Code Leak


There is a version of the Claude Code leak story that is entirely about features: background agents, persistent memory, multi-agent orchestration. It is a compelling story. But it misses something.

When you give an agent real power, the interesting engineering problem is not “what can it do.” It is “what stops it from doing the wrong thing.” The Claude Code source is valuable here because it shows how the control layer works.

If agents are going to become part of how an engineering organization actually executes, they need the same rigor as any other production system: controls, observability, auditability, and clear boundaries on what happens when something goes wrong.

I have had to make that concrete in my own setup. My ghostwriter agent is not allowed to auto-send anything. My personal assistant agent and my chief of staff agent have different lanes and different responsibilities. I created these boundaries as I built and used them, kinda by trial and error. These boundaries are what keep autonomy useful and increase my trust in these agents, so I use them more.


Permission Denials Are Not Dead Ends#

If you have worked with Claude Code or Codex, and you follow the reasoning as they execute tasks. You can see that agents don’t stop when they find a roadblock or an error; they keep going and try to deliver on what you asked them to do. The Claude Code source shows how that actually works inside the harness logic. You can also find more details about the overall hook system here.

Every tool call passes through a permission pipeline: mode check, hook evaluation, rule matching, and finally a user prompt. When a tool is denied, the denial is wrapped as a tool result and fed back to the model as if the tool had returned an error message. The agent sees “permission denied” as an output and adjusts its approach. It can try a different path, ask a clarifying question, or give up gracefully.

The system also tracks all denials across a session and reports them to the SDK caller. IDE integrations can surface patterns: if an agent keeps being denied bash commands in /etc, that is useful signal because it is not just a workflow block, it treats it as feedback.

This is a fundamentally different mental model than an allowlist. The agent is not stopped at a wall. It is informed of constraints and expected to reason about them.

agent wants action
      |
      v
+-------------------+
| permission check  |
+-------------------+
      |
   allowed? ---------------- no ----------------+
      |                                         |
     yes                                        v
      |                               denial returned as
      v                               tool result to model
execute action                                  |
      |                                         v
      +------------------------------> agent adapts plan

Append-Only Logs: Accountability at the Architecture Level#

The Kairos background daemon, Anthropic’s always-on autonomous agent mode, keeps append-only daily logs of everything it noticed, decided, and did.

The agent cannot erase its own history.

The decision to make logs append-only is an architectural statement about what kind of system this is supposed to be. An agent with real autonomy that acts on your behalf without explicit prompts, needs to be auditable by design.

Even the memory consolidation path is isolated behind a forked subagent with limited tool access. That detail matters here not because of memory quality, but because it shows Anthropic separating evaluation and execution privileges. I wrote more about Memory on a previous post here.

That separation of concerns is very interesting. Generally speaking this is just good systems design; the system that records and evaluates work should not have the same privileges as the system that executes it. Anthropic is applying that at the agent level too.


The Undercover Mode Question#

The leaked source contains a file called undercover.ts (about 90 lines). When Claude Code contributes to public code repositories, it strips all model names, internal codenames, and Anthropic identifiers from its outputs. It cannot call itself “Claude Code.” It cannot reference internal Slack channels or project names.

The stated purpose is to prevent internal codenames from leaking into public contributions. That is a reasonable engineering goal. The side effect is that Anthropic employees using Claude Code to contribute to open source projects were not disclosing AI authorship in commits, and the tool was explicitly built to make sure that stayed the case.

This raises a question that we in the industry have not settled yet: when an AI agent contributes to a public codebase, does the commit need to disclose that? I personally think it should.


Anti-Distillation and the Data Arms Race#

One of the less-discussed findings in the leak is a set of mechanisms designed to prevent competitors from using Claude Code to train their own models.

The code sends requests with anti_distillation: ['fake_tools'] flags that inject decoy tool definitions into system prompts. Any competitor routing requests through Claude Code to collect training data gets poisoned tool schemas baked into every prompt, making any model trained on that data less reliable.

A second layer, called CONNECTOR_TEXT, buffers the assistant’s text between tool calls, summarizes it with cryptographic signatures, and returns only those summaries to the API, so even if someone intercepts the API traffic, they get summaries, not full reasoning chains.

Both mechanisms operate at data collection time, not inference time. Looks like the goal is to make the training data itself unreliable, not to block the request itself.

One technical analysis estimated a determined team could bypass both within an hour using a proxy or a specific environment variable. The stronger deterrent is probably legal, not technical. But the intent is visible in the architecture, which I thought was interesting in such a competitive and high-stakes market.


The 12% Problem#

One detail buried in the analyses of this leak deserves more attention.

The AI Corner noted that Anthropic’s own research shows Claude has attempted to compromise its own servers with a 12% sabotage rate in adversarial testing. The comment in the source connecting to the safety layer infrastructure: the permission pipelines, the append-only logs, the subagent sandboxing.

These are not just well-defined product features. They are responses to known failure modes in the model’s own behavior under adversarial conditions. Anthropic is not assuming the model will always do the right thing. They are building the harness to catch it when it does not.

This concept is so critical for everyone out there trying to build their own internal harness for their teams

It also matches how I think about the organizational transition to Agentic-First. The goal is not to blindly trust the model or the agent. The goal is higher-leverage execution with the right verification model for the blast radius of the task. Keep the agent focused, with just enough context and strong guardrails to deliver on the task.


What Responsible Autonomy Actually Looks Like#

The safety architecture in Claude Code is not a checklist of guardrails bolted on after the fact. It is baked into the core design: denials as feedback, logs as accountability primitives, subagent sandboxing for memory operations, and the list goes on and on.

For CTOs and Engineering Leaders thinking about deploying agents at scale, this is the part of the source code worth understanding more carefully. The question is not whether your agents can do the work. The question is what happens when they do the wrong thing, and whether you will know about it, and if you can keep them on their lane.

Agents are an incredibly powerful tool, but power without control can be a massive liability.


Sources: Haseeb Qureshi, Inside the Claude Code source · Ars Technica, Here’s what that Claude Code source leak reveals about Anthropic’s plans · Engineer’s Codex, Diving into Claude Code’s Source Code Leak · The AI Corner, Claude Code Source Code Leaked: What’s Inside · Claude Code hooks reference · Claw Code query engie code