Iteration one feels like magic. You describe a feature, the model delivers something believable, and you think: "We're so back."
Iteration two is usually fine. A tweak here, an edge case there.
Then iteration three hits.
The model changes unrelated files. Tests get weaker. Naming shifts. It "cleans up" working code. The diff becomes impossible to review. You spend more time steering than building—and the productivity gains evaporate.
This pattern has a name: prompt roulette.
It's what happens when shipping depends on how a model feels about your request in that moment, rather than a stable process. And it's why so many teams say, "AI is great... until it isn't."
Most people blame the model when things go sideways. "It got worse." "It forgot my instructions." "Today it's dumb."
Sometimes models do change. But the more common reason is simpler: each iteration changes the inputs in subtle ways.
By iteration three, your AI is operating with:
Humans handle messy, evolving context. LLMs recompute the best next answer from whatever you've given them—every single time.
They don't maintain a stable internal "truth." So iteration three isn't a mystery. It's the point where entropy wins.
Research backs this up: Rohan Paul shared a study showing GPT-5 shifted its stated beliefs by 54.7% after just 10 debate rounds. Beliefs and actions drift from context alone, not just from attacks.
Iteration one: "Add input validation to registration."
Simple. Minimal decisions.
Iteration two: "Also handle empty emails and update error messages."
Still manageable.
Iteration three: "Don't change API responses, keep backwards compatibility, don't add dependencies, update tests, and also can you clean up this function?"
Now the model has to juggle behavior constraints, compatibility rules, style preferences, scope limits, test requirements, and refactor temptation.
Unless you've set hard guardrails, it will choose the path that sounds correct rather than the one that is safest.
This is where AI stops acting like a careful teammate and starts acting like a well-meaning intern with too much confidence.
As Itamar Friedman noted, there's a difference between "vibe coding" (prototypes, weekend hacks) and AI-assisted coding you professionally care about. Prompt roulette is what happens when you treat production code like a vibe project.
Beyond wasted time:
This is why teams plateau: AI is useful for quick wins, but hard to trust for sustained work.
The fix isn't "write better prompts." The fix is to move from prompting to process.
Andrej Karpathy shared his rhythm for AI-assisted coding he actually cares about: stuff relevant context, describe single concrete changes, ask for approaches first (not code), review carefully, test, commit. The emphasis is on "keeping a very tight leash on this new over-eager junior intern savant." OpenAI also released a 28-page guide on context engineering—when to trim, summarize, and prevent drift.
Here's the practical implementation:
Instead of relying on conversation context, use durable documents:
When iteration three comes, don't say "as we discussed above." Point to a stable spec: "Follow requirements.md and do not change API output shape."
Create a standard "must not change" list:
Put it in a reusable rule file. Include it automatically.
Require the agent to state:
This takes 30 seconds and prevents 30 minutes of cleanup.
Every iteration (not just final), run:
This turns "I think it works" into "it passed."
A reviewer agent catches blind spots:
Even better if it's a different model family. Different models fail differently.
If you want AI to be stable across iterations, treat it like a production pipeline:
That's what "AI-first engineering" actually looks like. Not a bigger prompt—just a better system.
The best teams aren't chasing an AI that's magically correct.
They're building a workflow where:
Because speed isn't the hard part anymore. Reliability is.
And prompt roulette is what happens when you don't design for it.