Most engineering teams that adopt AI coding tools go through the same phase. They pick a model, usually whatever's at the top of the benchmarks that month, and use it for everything. Planning, writing code, reviewing what they just wrote. One model, all phases, done.
It's the default. And the best teams have quietly moved past it.
Not because of any particular belief in one model over another, but because they've noticed what shows up in the numbers. Costs compound faster than expected. Reviews catch less than they should. And something subtle happens to output quality in long sessions that's hard to point to but very real when you see it. Once you understand why these things happen, the fix is pretty obvious.
Think about what a real feature task involves. There's a planning phase where you're making architectural decisions: data models, API shapes, how this fits into the existing codebase. Then there's implementation, which is mostly execution: writing handlers, tests, schemas, the connective tissue between things. Then review.
These phases have almost nothing in common in terms of what they need from an AI.
Planning needs real reasoning capability. Frontier models like Claude Opus or GPT-5.4 genuinely earn their price tag here because architectural judgment is exactly what they're optimized for. Implementation is different. Once you have a solid spec, implementation is mostly execution against known requirements. Writing a unit test for a function you've already defined is not a hard reasoning task. It's a fast, accurate, repetitive one.
Frontier models cost an order of magnitude more per token than fast models like Gemini Flash or Claude Sonnet. And in a typical feature task, implementation accounts for roughly 70% of all tokens used. Planning is around 15%. Review is the remaining 15%.
| Phase | Share of tokens | What it actually needs |
|---|---|---|
| Planning | ~15% | Architectural reasoning, judgment calls |
| Implementation | ~70% | Speed, accurate execution against a spec |
| Review | ~15% | A perspective that isn't the author's |
If you're running a frontier model through all three phases, you're paying frontier prices for the phase where the reasoning advantage matters least. A fast model working from a well-defined spec produces implementation that's essentially indistinguishable from a frontier model working without one, because the hard thinking already happened upstream.
The blended efficiency difference between one-model-for-everything and matching the right model to each phase is large enough to notice on your bill at the end of the month.
This pattern was independently documented by Harper Reed, whose multi-model workflow (brainstorm with a conversational model, plan with a reasoning model, execute with an agentic tool) became one of the most widely circulated posts in the AI engineering community in early 2025. Different model for planning, different model for execution. The spec artifact as the handoff mechanism. Engineers were landing on this pattern independently, before any tooling made it easy.
The cost issue is visible in your billing. The review problem is less obvious, which is part of why it persists.
When a model reviews code it wrote, it tends to retrace the same reasoning paths it followed the first time. The edge case it missed during implementation? It often misses it again. Not because it's incapable of finding it, but because it's not approaching the problem fresh. It already made decisions about how this code should work, and it's reading the implementation through that lens.
Andrej Karpathy put it well in a January 2026 X post after spending a lot of time with AI coding agents:
The mistakes have changed a lot. They are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking.
This is what makes self-review structurally weak. The model that produced those wrong assumptions is the same one being asked to catch them. Different models, trained on different data with different architectures, develop different blind spots. A model that didn't write the code brings structurally different assumptions to the review. And when you run two or three independent models against the same diff, the convergent signal is meaningfully stronger than any single reviewer. Independent agreement is evidence in a way that one opinion isn't.
This one is hard to measure but very consistent once you've seen it.
A model that spent 40 minutes exploring architectural approaches, rejecting some, working through edge cases, carrying on a planning discussion. All of that is still in context when implementation starts. Context windows have limits, and the more noise accumulated before implementation begins, the more the model has to navigate around it. Generation quality drifts in ways that don't show up as obvious failures. The code looks fine. The tests pass. But something is off about how the pieces fit together, and it takes a code review to surface it.
Karpathy identified this as the central engineering problem in agentic AI systems, calling it "context engineering" rather than prompt engineering: the discipline of filling the context window with just the right information for the next step.
The fix is a clean context for each phase. Not a longer context window.
Subagents solve this. An isolated sub-process with a fresh context window receives only what it actually needs: the implementer gets the spec, the reviewer gets the diff. Neither carries the conversation history of the previous phase. The model starts each job without baggage.
The pattern that shows up across teams that have figured this out is consistent: they match model to phase, and they write it down so it's automatic.
Zenflow builds this directly into its workflow engine. You create named model configurations in Settings (opus-planner, flash-builder, sonnet-reviewer) and bind them to workflow steps with a comment in your workflow file.
### [ ] Step: Planning
<!-- agent: opus-planner -->
Analyze the requirements. Produce spec.md covering architecture,
affected files, edge cases, and verification criteria.
### [ ] Step: Implementation
<!-- agent: flash-builder -->
Implement the changes described in spec.md. Follow existing patterns.
### [ ] Step: Review
<!-- agent: sonnet-reviewer -->
Review all changes against spec.md. Score on semantic resolution,
contract adherence, regression safety, and scope discipline.
Record findings in review.md.
Zenflow switches to the assigned model at each step boundary. What keeps the different models aligned isn't shared chat history. It's the spec artifact that the planner writes to disk and every downstream phase reads from. Same mechanic Harper Reed documented, now automated.
For higher-stakes changes, you can push further: spawn three isolated review workers in parallel, each using a different model, with an orchestrator that consolidates the findings into a single report.
Review Orchestrator
├── Worker 1: GPT-5.3 Codex → review_worker_1.json
├── Worker 2: GPT-5.4 Think → review_worker_2.json
└── Worker 3: Gemini 3.5 Flash → review_worker_3.json
↓
final_review.md
If you don't want to configure a full pipeline, running /comprehensive-review from chat triggers this same parallel review pattern on demand.
Everything above applies within a single task. But the same logic extends to running multiple tasks simultaneously. We covered the senior engineer's version of this in The Senior Engineer's Multi-Model Workflow: Spec Once, Ship Three, but the basics matter here too.
Each Zenflow task runs in its own Git worktree at .zenflow/worktrees/{task_id}. Agents on different tasks are working in entirely separate directories, so there's no file contention and no branch collisions while work is in progress.
.zenflow/worktrees/
├── task-abc123/ ← flash-builder implementing auth
├── task-def456/ ← opus-planner spec'ing the payments API
└── task-ghi789/ ← sonnet-reviewer auditing the data layer
A small team can run several tasks concurrently, each configured with the right model for that type of work, and review them as they finish. This is exactly what Harper Reed was asking for at the end of his February 2025 post:
I have spent years coding by myself, years coding as a pair, and years coding in a team. It is always better with people. These workflows are not easy to use as a team. The bots collide, the merges are horrific, the context complicated. I really want someone to solve this problem in a way that makes coding with an LLM a multiplayer game.
Harper Reed · February 2025
The Multi-model workflow is available in Zenflow from the workflow picker when you create any code task. The default configuration already applies a frontier model to planning and a fast model to implementation, so most of the efficiency benefit is there without extra setup.
From there, a reasonable path forward:
Set up agent presets for the two or three model pairings you care most about. Start with a planner and a builder. The planning-to-fast-model handoff is where the majority of efficiency gains are.
Try /comprehensive-review on your next substantial PR before requesting a human review. The gap between what it surfaces and what a same-model review catches tends to be instructive the first time you run it.
Once you've found a combination that works well for a class of tasks, save it as a custom workflow template. The whole point of getting the model selection right is that it should be automatic, not something you reconfigure on every task.
The teams getting the most out of AI coding tools aren't necessarily the ones with the biggest model budgets. They're the ones who've thought carefully about what each phase of a task actually needs, and matched accordingly.
A multi-model workflow assigns different AI models to different phases of a coding task. A frontier reasoning model handles planning. A fast model handles implementation. A different model reviews. The phases are coordinated by a spec artifact every model reads from.
Different phases need different things. Planning needs reasoning depth. Implementation needs speed and accurate execution against a defined spec. Review needs a perspective that didn't write the code. Using one frontier model for all three is expensive for implementation and weak for review.
Cross-model review is code review where the reviewer model is from a different model family than the model that wrote the code. Different families have different blind spots, so a different-family reviewer catches different classes of issues. Running multiple reviewers in parallel and looking for convergent findings gives stronger signal than a single reviewer.
Zenflow lets you create named agent presets in Settings, each tied to a specific model. You bind presets to workflow steps with an HTML comment in the workflow file. Zenflow switches models at each step boundary. The spec artifact written by the planner is automatically available to the implementer and reviewer.
Skip it for small, well-scoped tasks where setup overhead would dominate. Use it for feature work, recurring task types, and any change important enough that you want a structured cross-model review before merging.