AI is amazing on day one.
By day ten, reality shows up. The model "forgets" your conventions. It starts changing unrelated files. It breaks tests it didn't touch. Your carefully crafted codebase gradually degrades into what teams politely call "drift" and privately call "slop."
You know the pattern: Week one feels like magic. Week four feels like babysitting. By week eight, you're wondering if AI is actually making things worse.
Here's what most teams get wrong: they think they have an "AI quality problem."
They don't.
They have a reliability system problem.
AI isn't like a junior developer who learns from corrections. It's stateless. Every interaction starts fresh. Without a system—consistent inputs, persistent rules, verification loops, and reviewable artifacts—quality degrades with every iteration.
The good news? Reliability is fixable. You don't need better prompts or smarter models. You need a checklist that keeps work repeatable: same inputs, same rules, same verification, same artifacts.
This is that checklist.
1. Lock the Inputs (don't make the model guess)
Checklist:
- Always start from a task brief (one paragraph) with a definition of done
- Include constraints: what must not change (API shape, DB schema, dependencies)
- Provide examples when possible (sample request/response, edge cases, expected output)
- Reference the exact files or modules when you know them
- Keep a "don't touch" list (generated code, vendor dirs, critical modules)
Why this matters: AI output becomes unstable when the model has to invent missing details. Every gap creates variability. Your goal is simple: reduce ambiguity, not write a long prompt.
Practical tip: A good brief is often just: goal + constraints + acceptance criteria + test expectations.
2. Make Rules Explicit (and reusable)
Checklist:
- Store your coding rules in versioned files (not in someone's head)
- Keep rules short and enforceable: logging, error handling, naming, patterns
- Add "never do" rules (no new deps, no schema changes, no refactors)
- Auto-include these rules in AI context for every task
- Update rules when reviews repeat the same feedback
Why this matters: If rules aren't persistent, you'll re-teach the model every time. That's where drift comes from—each run becomes a new negotiation.
Andrej Karpathy nailed it: "Context engineering" beats "prompt engineering." Industrial-strength AI apps succeed through the delicate art of filling the context window with just the right information—task descriptions, examples, relevant data, tools, state, and history. Too little context and performance suffers. Too much and costs explode while quality drops.
3. Use a Spec or Plan Step Before Coding
Checklist:
- Generate/maintain a requirements doc (even a short one)
- Generate a technical plan for multi-file changes
- Confirm plan matches constraints before editing code
- Break work into steps that can be verified independently
- Don't let implementation start until the plan exists
Why this matters: A plan is your first anti-drift mechanism. It makes work sequential and reviewable. Without it, agents improvise.
4. Keep Diffs Small on Purpose
Checklist:
- Require "scope discipline": only change what the task needs
- Avoid drive-by refactors unless explicitly asked
- Set a soft limit on files changed (or require justification)
- If a change touches many layers, split into multiple PRs/steps
- Ask for a diff summary: "what changed + where + why"
Why this matters: Large diffs don't just slow reviews—they hide mistakes. Smaller diffs make reliability visible.
5. Always Run Verification (no exceptions)
Checklist:
- Unit tests run for every change
- Lint/format/typecheck run automatically
- Build runs when relevant
- CI must be green before merge
- Fail the workflow if checks fail (don't "ship and hope")
Why this matters: Verification is the reliability engine. Without it, output quality depends on luck and reviewer attention.
Practical tip: Don't chase global perfection. Enforce checks on touched files/modules first. That's where reliability starts.
As Itamar Friedman pointed out, the percentage of time spent debugging has gone up massively with AI coding. Karl Weinmeister added: "You still need code quality checks in the era of AI. Different prompts, context, and models lead to different outcomes."
6. Add a Second Pass: Review Agent or "Critic" Step
Checklist:
- Run a reviewer pass after implementation
- Reviewer focuses on: missing tests, edge cases, security, performance
- Keep feedback high-signal: top 5 risks, not 50 nits
- Require the builder to address reviewer findings or explain why not
- Use different model(s) for review when possible
Why this matters: A single model can be confidently wrong. A second pass catches blind spots and stabilizes output quality across runs.
7. Make Artifacts Mandatory (so every run is explainable)
Checklist:
- Every task produces an artifact: plan/spec doc, PR description, or test report summary
- Include "what changed, why, how to verify" in the artifact
- Store artifacts alongside code (or in your task system)
- Avoid "chat-only" work that disappears
Why this matters: Artifacts make output reviewable and repeatable. They also make "why did this happen?" answerable.
Anthropic's context engineering guide emphasizes this: to get the most out of AI agents, you need more than prompts—you need persistent, structured context that agents can reliably reference across iterations.
8. Handle Iteration Safely (the "change request" loop)
This is where most AI setups break down. The agent makes a change, review asks for tweaks, the agent "fixes it," and suddenly unrelated code changes appear.
Checklist:
- When applying review feedback, restrict context to the PR + relevant files
- Repeat constraints ("don't change API", "no refactor") in the change request
- Re-run verification after every iteration
- Require a short delta summary: "I changed X to address comment Y"
- If scope creeps, split into a new task
Why this matters: Iteration is where drift is born. Tight context + repeated checks keep output stable.
9. Watch for Reliability Smells (catch drift early)
Warning signs:
- PR changes unrelated files "because it seemed cleaner"
- Tests were skipped or replaced with weak ones
- The agent rewrote working code instead of minimal edits
- Error handling/logging style changes randomly
- Output feels different week to week without a reason
What to do: When you see these, don't just correct the PR—update your rules or workflow so it doesn't repeat.
10. Measure Reliability (so you can improve it)
You don't need fancy dashboards. Track a few simple things:
Metrics to watch:
- PR cycle time (open → merge)
- CI time-to-green (first fail → pass)
- Review rounds per PR
- Change failure rate (hotfixes/rollbacks/incidents)
- "Rework commits" (commits that only fix earlier AI mistakes)
Why this matters: Reliability is about trends. If review rounds drop and CI stabilizes, your system is working.
AI output becomes stable when you stop treating every run like a conversation and start treating it like a pipeline.
Reliable teams don't rely on "better prompts." They rely on:
- Consistent inputs
- Persistent rules
- Small diffs
- Verification
- Second-pass review
- Durable artifacts
That's how you keep AI output stable across iterations—without slowing developers down.