Reasoning Models vs. Execution Models: Stop Using o1 for Agent Workflows
Every week I see another engineer post about running their AI agent on o3, DeepSeek R1, or Claude with extended thinking enabled. They're excited. Their agent got smarter.
Then they hit production and wonder why their costs are 10× what they projected and their agent is slower than their users expected.
I've run the benchmarks. Let me show you the numbers.
What Reasoning Models Actually Do
When you call o1, o3-mini, DeepSeek R1, or Claude Sonnet with thinking mode, the model doesn't just generate an answer. It thinks first.
That thinking happens in a scratchpad — internal tokens you usually don't see (or pay for explicitly). The model reasons through the problem: "What are the constraints? What approaches exist? What are the edge cases? Let me work through this step by step."
This is genuinely powerful for:
- Complex math and logic problems
- Code debugging where the bug isn't obvious
- System design decisions with real tradeoffs
- Novel situations with no clear playbook
But here's the thing. That thinking costs tokens. Lots of them. Every single time.
What Agent Execution Actually Is
A well-designed agent workflow isn't a novel problem. It's a defined procedure.
Extract this field from the JSON. Check if this value matches the schema. Route to this endpoint if condition A, that endpoint if condition B. Format this data structure. Call these three APIs in sequence.
These are not hard problems. They're mechanical. They're boring. They require instruction-following, not reasoning.
The Benchmark: Real Numbers, Real Failure
I ran AgentMemo's board intelligence workflow — a multi-step analysis task that processes company data, validates output schema, and structures a report — across three executor models. Same workflow. Same canary validation. Real API calls.
| Model | Type | Canary Pass | Latency | Cost/run | Verdict |
|---|---|---|---|---|---|
| gpt-5-nano | Reasoning | FAIL | 8–13s | ~$0.001+ | ❌ Wrong fit |
| Groq Llama 4 Scout | Fast / Non-reasoning | 0/5 | ~2s | ~$0.001 | ❌ Output truncated |
| gpt-4.1-nano | Non-reasoning | 4/5 | 2–4s | ~$0.0003 | ✅ Winner |
gpt-5-nano isn't a bad model. It's a reasoning model solving a mechanical problem. It spent the first 8–12 seconds of every run thinking about a workflow that was already fully specified. Then it produced output that consistently failed schema validation because reasoning models optimize for "thoughtful answers" — not precise, rigid, JSON-structured output.
The Mental Model: Design-Time vs. Execution-Time Intelligence
Here's the framework that fixes the confusion:
Design-Time: Where Reasoning Belongs
Design-time is when you're figuring out the workflow itself. What steps should the agent take? What are the edge cases? How should it handle failure? What's the output schema?
This is genuinely hard. Novel. Full of tradeoffs. This is where you want a reasoning model. Use Opus. Use o3. Use Claude with max thinking budget. Spend $5–20 figuring out exactly what the agent should do. Document every step, every edge case, every recovery path.
You only pay this cost once per workflow design.
Execution-Time: Where Fast and Cheap Wins
Execution-time is when you're running the workflow you already designed. The agent follows defined steps. It's mechanical. The intelligence is already encoded in the blueprint.
This is where reasoning models destroy your economics. You don't need the model to think. You need it to follow instructions precisely and quickly.
Why the Blueprint IS the Reasoning
The insight that changes how you think about model selection:
When you have a well-designed blueprint, you've already done the hard reasoning. You've stored it. The executing model doesn't need to reason — it just needs to read and follow.
This is exactly how humans operate in mature organizations. A senior engineer thinks deeply to write a runbook. A junior engineer follows that runbook exactly. The junior engineer doesn't need to be a senior engineer — they need to be reliable and fast.
Reasoning models are senior engineers. Non-reasoning models are reliable executors. Both are valuable. Most agent systems use them exactly backwards.
When to Actually Use a Reasoning Model in an Agent System
Reasoning models have three legitimate jobs in agent infrastructure:
1. Blueprint Design (One-Time)
Use Opus or o3 to design the workflow from scratch. Think through every step, every edge case, every failure mode. Document it. Store it. Pay for this intelligence once.
2. Edge Case Patching (Rare)
When a cheap executor hits a case the blueprint didn't anticipate, escalate to a smarter model to write a surgical patch. At AgentMemo, a Claude Sonnet escalation costs ~$0.037 and writes a precise new rule for the blueprint. The executor never fails the same way again.
At 1,000 executions: total intelligence cost = ~$0.20 (one-time). Total execution cost = $0.30. Versus running o3 on every execution: $1.50–$5.00+.
That's 10–25× cheaper, and the blueprint only gets smarter over time.
3. Novel Decision Points (Exceptional)
Some agent tasks genuinely require reasoning — situations the blueprint couldn't anticipate, judgment calls with major consequences, synthesizing truly novel information. Route these specific steps to a reasoning model. Don't run the whole workflow through one.
The Model Selection Checklist
Before choosing a model for your agent task, answer these questions:
- Is this a defined procedure? → Non-reasoning model. Always.
- Is this a novel problem with real uncertainty? → Reasoning model.
- Will this run more than 10 times? → Non-reasoning model for execution, reasoning model for design.
- Is structured JSON output required? → Non-reasoning model. Reasoning models optimize for thoughtful prose, not rigid structure.
- Is low latency important? → Non-reasoning model. Reasoning adds 5–30s of thinking time.
- Is this a one-time architectural decision? → Reasoning model.
The Right Architecture
The pattern that works at scale:
This is what AgentMemo is built for. The intelligence compounds in the blueprint. The execution stays cheap. You never pay for the same reasoning twice.
Build Agent Workflows That Don't Burn Your Budget
AgentMemo gives your agents persistent blueprints, smart escalation, and validated execution — designed once, running forever on the cheapest model that works.
Start Free →