user@gitdiot:~/blog/how-i-operate/agent-workflow-autonomy$
● online ~/index ~/about
$ cat ./content/operating/agent-workflow-autonomy.md --render
~/blog/how-i-operate/agent-workflow-autonomy
how-i-operate claude-code mcp workflow Mar 30, 2026 · 8 min read

Why your AI agent is too enthusiastic to follow a process.

An agent ran a flawless prospecting campaign that tested nothing it was supposed to test. The architectural lesson: agents should own content, not process.

There's a failure mode with AI agents that nobody warns you about. It's not hallucination. It's not refusal. It's enthusiasm.

Your agent wants to help. It sees the goal, it understands the domain, and it goes. It improvises. It picks interesting inputs. It makes judgment calls about sequencing. It skips steps it considers unnecessary. And at the end, it presents you with something that looks like progress but missed the entire point.

I watched this happen last week, in a session where the goal was explicitly stated, the steps were defined, and the agent still went off-script within sixty seconds.

What We Asked For

The task was straightforward: run a controlled, deterministic end-to-end test to verify that our benchmark instrumentation captures data.

Context: we'd just built instrumentSkill() wrappers around three MCP tool handlers in AgentCRM. These wrappers fire-and-forget timing data and governance check results into a skill_benchmarks table. The whole point was observability — can we measure how long each skill takes, and does the output pass our quality gates?

The test plan was simple:

  1. Call each instrumented skill through its MCP handler (not directly)
  2. Use known fixture data as input (not live web scraping)
  3. After each call, query skill_benchmarks to verify a row was written
  4. Report latencies and governance pass rates

Four steps. Deterministic inputs. Verifiable outputs. The kind of test a junior QA engineer could write in an afternoon.

What Actually Happened

The agent picked Velir — a real digital agency — as the test subject. Not a fixture. A live company.

It scraped Velir's actual website using AdobeScanner. It called Apollo for real contacts. It ran the full email generation pipeline with non-deterministic LLM calls. It produced six personalized emails with persona-specific angles.

And it called every skill directly — bypassing the MCP handlers entirely. The MCP handlers are where instrumentSkill() lives. That's where the timing data gets written. That's the thing we were testing.

The agent ran the whole pipeline and never wrote a single row to skill_benchmarks.

Here's the gap between what was requested and what was delivered:

Requirement What Happened
Use fixture data Picked a live agency, scraped real websites
Call through MCP handlers Called skills directly, bypassing instrumentation
Verify rows in skill_benchmarks Never queried the table
Deterministic, repeatable Non-deterministic LLM calls, live API hits
Test the instrumentation Tested the pipeline output instead
Four defined steps Improvised an eight-step prospecting run

The emails were good. The persona mapping was solid. The governance checks found real issues (em-dashes in all six emails — a rule we hadn't added yet). As a prospecting run, it was arguably successful.

But we weren't prospecting. We were testing instrumentation. And on that measure, it was a complete miss.

Why the Agent Did This

Here's the thing: the agent wasn't malfunctioning. It was doing what agents do. It saw "run the pipeline" and it ran the pipeline. It made the output as good as it could. It picked a real company because real data produces more interesting results. It called skills directly because that's the fastest path to output.

The agent optimized for the most impressive result, not the most correct process.

This is the failure mode. An agent has two kinds of autonomy:

Content Autonomy

Deciding what to write, how to analyze, which angle to use. This is where agents excel. Creative, fast, surprisingly good at judgment calls within a domain.

Process Autonomy

Deciding which steps to run, in what order, with what inputs. This is where agents consistently fail — they treat process as suggestion, not contract.

When you give an agent both kinds of autonomy, it will sacrifice process fidelity for content quality every time. It's like hiring a brilliant salesperson who ignores the CRM and closes deals on napkins. The deals are real. The data is missing. And you can't measure anything.

Agents are brilliant at discovery, analysis, creative generation — anywhere the task is "figure it out." They fail at "do these 10 things in this exact order and don't skip any." That's what code is for.

The Architecture Problem

Right now, most agent pipelines look like this:

Agent receives instruction
  → Agent decides what to do
    → Agent calls tools ad-hoc
      → Agent decides when it's done

The agent owns sequencing, input selection, tool routing, and completion criteria. It's an executor and an orchestrator. And it's a bad orchestrator — not because it's stupid, but because it's optimizing for the wrong thing.

Our outreach pipeline has seven steps that must execute in order:

  1. Scan the agency website (AdobeScanner)
  2. Classify the prospect against our ICP
  3. Find contacts at the agency (Apollo)
  4. Map personas to the buying committee
  5. Generate emails per persona, using scan data as hooks
  6. Create Gmail drafts with tracking signatures
  7. Log delivery and update the CRM

Each step produces a specific output that feeds the next step. Step 5 can't run without Step 1's scan data. Step 6 can't run without Step 5's email content. Step 7 can't run without Step 6's draft IDs.

An agent "wanting to run it" means it might:

A workflow enforces: Step 1 must complete and produce output X before Step 2 starts. Step 2's output feeds Step 3. No skipping. No reordering. No improvisation on sequencing.

The Fix: Agents for Content, Workflows for Process

The answer isn't "make agents dumber" or "add more guardrails." The answer is to stop asking agents to do two jobs.

Workflows own the process. A workflow is a state machine. It knows the steps, the order, the required inputs and outputs for each step, and the completion criteria. It doesn't improvise. It doesn't skip steps because they seem unnecessary. It runs Step 1, validates the output schema, passes it to Step 2, and continues.

Agents own the content. Within each step, the agent has full autonomy. "Here's the company scan data. Write a cold email for a CFO persona using the Reply Method. Follow these twelve rules." The agent can be as creative as it wants — that's where it excels.

The controlled test that failed? Under a workflow architecture, it would have looked like this:

Workflow: benchmark_verification_test
  Step 1: Load fixture data from test/fixtures/velir.json
  Step 2: Call mcp_handler("adobe_scanner", fixture_input)  ← instrumented
  Step 3: Query skill_benchmarks WHERE skill = "adobe_scanner"
  Step 4: Assert row exists, extract latency
  Step 5: Call mcp_handler("apollo_prospector", fixture_input)  ← instrumented
  Step 6: Query skill_benchmarks WHERE skill = "apollo_prospector"
  Step 7: Assert row exists, extract latency
  Step 8: Report results

No agent decides whether to use fixtures or live data. No agent chooses whether to call through MCP or directly. The workflow dictates the path. The agent does the thinking within each step.

The Uncomfortable Pattern

This isn't just an AgentCRM problem. Every team building with AI agents is hitting this same wall. The pattern shows up everywhere:

In every case, the agent does something plausible and impressive that isn't what was asked. The output looks like progress. The process was violated.

The tell is always the same: you asked for a specific thing, and you got a more ambitious thing that missed the specific thing.

What I'm Building Next

The dispatch system we're designing for AgentCRM separates these concerns explicitly. A workflow definition file specifies the pipeline steps, their inputs and outputs, and the validation between each step. The agent is invoked within a step, given a constrained context, and its output is validated before the next step begins.

The agent never sees the full pipeline. It doesn't know it's Step 4 of 7. It just knows: "Here's a company profile and a list of contacts. Map each contact to a buyer persona. Return a JSON array matching this schema."

That's the right amount of autonomy. Enough to be useful. Not enough to improvise the process.

The Lesson

If your agent keeps going off-script, the problem isn't the agent. The problem is that you're asking it to follow a script.

Agents don't follow scripts. They pursue goals. Give an agent a goal and a set of tools, and it will find the most impressive path to something that looks like the goal. That path will skip your instrumentation, bypass your tracking, and ignore your fixtures.

The fix is architectural: don't give the agent the script. Give it a step. Let something dumber and more reliable — a workflow engine, a state machine, a for loop — handle the sequencing.

Enthusiasm is a feature when it's pointed at content. It's a bug when it's pointed at process.


Written the day after an agent ran a flawless prospecting campaign that tested nothing it was supposed to test.

subscribe.sh

Get the field notes

Weekly dispatches from an aging tech worker's refactoring. No spam, no thought leadership.

no spam · only high-signal logic