The market has been flooded with new models lately. In February alone, the Big Three released updates to their models, along with several major OSS labs. Each of them claims success, and it's quite difficult to tell who's actually right. At the same time, despite this abundance of models, the set of agentic benchmarks available for them is surprisingly scarce; you can count the notable ones released in 2026 on one hand: Multi-SWE-Bench, RE-Bench, SWE-Bench Pro, SWE-Lancer (plus a few others with a slightly different focus). It's gotten to the point where some of the largest players are still reporting results on SWE-Bench Verified, a benchmark that has been essentially solved for almost a year now and only covers Python.
After hitting 70% on SWE-Bench Verified last year (the top result at the time, which held for about two weeks), we stepped back from benchmarks. 2025 was the year of agents. Everyone was setting up various agents and learning to work with them. Automated benchmarks made less sense in those scenarios: almost nobody ran agents completely unsupervised, and the best agent was the one that understood your instructions and did what you wanted through close collaboration.
However, progress turned out to be faster than everyone expected. 2026 is clearly going to be the year of orchestration and automation, making the autonomous task execution setup relevant again. In the recent batch, we ran numerous experiments, spending roughly $50,000 in total (including retries and failed runs). Part of it was our private evals, and part was open benchmarks. For the latter, we decided to share the results with the community. We're releasing a series of blog posts to share what we've learned.
In this first post, we look at the simplest setup: take a provider, take its CLI, run models on it, and measure the results. We decided to fix the major bugs related to benchmark setup ourselves (for example, some Go benchmarks use musl and CLIs don't always install cleanly on them), but we mostly left provider-specific issues alone.
We chose SWE-Bench Pro as our benchmark for several reasons. We had two key differences from standard SWE-Bench Pro: they use a unified harness, whereas we used native CLIs; and we ran tasks using Harbor, an open-source tool that manages SWE-Bench execution locally, rather than the official SWE-Bench Pro evaluation harness. (We also follow the work of the team behind RE-Bench, but decided against using it due to the small number of tasks and Python-only language coverage.)
For open-source and open-weights models, we used a shared harness and ran inference through Fireworks.
With that, let's wrap up the introduction and move on to the review.
SWE-Bench Pro Standard Results
First, here are the standard results, scored with the default SWE-Bench Pro setup, which includes detailed task instructions specifying interface names, class structures, and variable naming conventions.
|
# |
Model |
Score |
Avg Cost |
Avg Time |
|---|---|---|---|---|
|
1 |
claude-opus-4-6 |
52.7% ± 1.15% |
$1.35 |
9:30 |
|
2 |
claude-opus-4-5 |
51.5% ± 1.41% |
$1.41 |
8:30 |
|
3 |
openai/gpt-5.4 |
51.3% ± 1.33% |
$1.04 |
8:24 |
|
4 |
claude-sonnet-4-6 |
50.7% ± 1.44% |
$0.69 |
7:46 |
|
5 |
google/gemini-3.1-pro (custom tools) |
50.0% ± 1.74% |
$0.72 |
11:24 |
|
6 |
google/gemini-3.1-pro |
49.9% ± 1.83% |
$1.00 |
13:21 |
|
7 |
google/gemini-3-pro |
49.6% ± 2.05% |
$0.92 |
11:29 |
|
8 |
openai/gpt-5.3-codex |
49.6% ± 1.36% |
$0.71 |
7:25 |
|
9 |
glm-5 |
47.5% ± 3.04% |
$0.56 |
8:41 |
|
10 |
openai/gpt-5.2 |
47.2% ± 1.54% |
$1.02 |
14:21 |
|
11 |
openai/gpt-5.2-codex |
46.9% ± 1.58% |
$0.70 |
8:36 |
|
12 |
google/gemini-3-flash |
45.9% ± 1.66% |
$0.32 |
7:03 |
|
13 |
claude-sonnet-4-5 |
45.5% ± 1.61% |
$0.82 |
10:51 |
|
14 |
kimi-k2.5 |
44.8% ± 2.47% |
$0.27 |
8:23 |
|
15 |
minimax-2.5 |
41.1% ± 2.22% |
$0.10 |
5:28 |
These numbers are… boring. Every model is within a narrow band. Each generation is slightly better than the last — a bit cheaper, a bit faster, a bit more capable, but the differences are marginal. If you squint, all frontier models look roughly the same.
This is, in some sense, logical. We're in an era of knowledge diffusion: all labs train on similar data distributions, produce similar pairs of (PR, task description), and converge on similar capabilities. When you give a model a perfectly specified task — with exact class names, interface definitions, and variable naming conventions — it's almost a translation exercise. Of course they all do it about equally well.
But are they really the same? Not quite. And before we look under the hood to see how they actually differ, let me tell you about a $20,000 mistake that changed how we view these models.
The $20,000 Bug
No benchmark is perfect, and SWE-Bench Pro is no exception. Its standard setup gives models unrealistically detailed instructions: exact class names, interface definitions, variable naming conventions. That's not how tasks are written in real companies, but it's a necessary evil, because without precise naming, a model can produce a logically correct solution that fails every unit test simply because it is named a class TextCA instead of TextCa.
The opposite extreme would be to give the model a brief, human-like description ("fix the headers in the response") and let it figure out the rest. But then evaluation becomes impossible: you can't write deterministic tests against an unknown interface.
Due to a bug in Harbor's adapter for SWE-Bench Pro, we accidentally stumbled into a middle ground. The adapter leaked fail-to-pass tests from the golden patch, while the task description itself remained minimal. While it wasn't a perfect simulation of a real CI/CD pipeline, it produced a setup remarkably similar to a typical, slightly messy real-world regression scenario: CI catches a failure, the developer gets a short description of the issue plus a stack trace and failing tests, and has to figure out both what's wrong and how to fix it — without a detailed implementation spec.
This ate roughly $20,000 of our $50,000 total compute budget before we caught the bug. But the results were revelatory.
Regression-Style Results (Harbor Bug)
|
# |
Model |
Score (±CI) |
Avg Cost |
Avg Time |
|---|---|---|---|---|
|
1 |
claude-sonnet-4.6 |
78.90 ± 1.61% |
$0.69 |
6:55 |
|
2 |
claude-opus-4.6 |
76.58 ± 1.54% |
$1.29 |
8:33 |
|
3 |
claude-opus-4.5 |
66.99 ± 1.88% |
$1.25 |
8:09 |
|
4 |
gpt-5.3-codex |
63.42 ± 2.23% |
$0.70 |
5:20 |
|
5 |
gpt-5.4 |
62.33% ± 2.41% |
$1.00 |
7:32 |
|
6 |
gpt-5.2 |
59.73 ± 2.07% |
$1.26 |
13:52 |
|
7 |
gemini-3.1-pro (custom tools) |
55.34 ± 2.32% |
$0.59 |
13:50 |
|
8 |
gemini-3.1-pro |
52.33 ± 2.45% |
$0.75 |
16:07 |
|
9 |
claude-sonnet-4.5 |
51.37 ± 2.14% |
$0.71 |
8:48 |
|
10 |
gemini-3-flash |
50.55 ± 2.21% |
$0.29 |
8:31 |
|
11 |
glm-5 |
50.00 ± 4.12% |
$0.53 |
15:45 |
|
12 |
gpt-5.2-codex |
48.90 ± 2.09% |
$0.84 |
6:54 |
|
13 |
gemini-3-pro |
45.89 ± 2.73% |
$0.70 |
15:33 |
|
14 |
kimi-k2.5 |
43.15 ± 3.21% |
$0.29 |
7:50 |
|
15 |
minimax-2.5 |
32.60 ± 2.96% |
$0.11 |
7:56 |
Completely different picture. Anthropic dominates. Gemini Flash 3 outperforms Gemini 3 Pro. GPT-5.2 outperforms Codex-5.2. The spread between top and bottom more than doubles.
If you go on Twitter and ask developers which models feel best for day-to-day coding, this regression-style table maps far more closely to the popular sentiment than the standard one.
This is one of the key takeaways: people don't always measure what they think they're measuring. With well-specified tasks, all models converge. It's only when you introduce ambiguity (the kind of vague descriptions and red CI signals that actual developers deal with) that the true differences emerge. What matters isn't just the ability to solve a well-defined problem; it's the ability to figure out what the problem is from limited information.
To be clear: the Harbor setup didn't just make the benchmark harder or easier; it changed what the benchmark was measuring. That's why the ranking shifted, not just the scores.
Side-by-Side Comparison
Here are both setups together to clearly show the contrast. The first column shows scores with the standard SWE-Bench Pro instructions (detailed, precise specifications). The second shows scores from our regression-style run (minimal, more human-like descriptions with leaked failing tests).
|
Model |
Standard Score |
Regression-Style Score |
Δ |
|---|---|---|---|
|
claude-sonnet-4.6 |
50.7% |
78.90% |
+28.2 |
|
claude-opus-4.6 |
52.7% |
76.58% |
+23.9 |
|
claude-opus-4.5 |
51.5% |
66.99% |
+15.5 |
|
gpt-5.3-codex |
49.6% |
63.42% |
+13.8 |
|
gpt-5.2 |
47.2% |
59.73% |
+12.5 |
|
gpt-5.4 |
51.3% |
62.33% |
+11.0 |
|
claude-sonnet-4.5 |
45.5% |
51.37% |
+5.9 |
|
gemini-3.1-pro (custom tools) |
50.0% |
55.34% |
+5.3 |
|
gemini-3-flash |
45.9% |
50.55% |
+4.7 |
|
glm-5 |
47.5% |
50.00% |
+2.5 |
|
gemini-3.1-pro |
49.9% |
52.33% |
+2.4 |
|
gpt-5.2-codex |
46.9% |
48.90% |
+2.0 |
|
kimi-k2.5 |
44.8% |
43.15% |
-1.7 |
|
gemini-3-pro |
49.6% |
45.89% |
-3.7 |
|
minimax-2.5 |
41.1% |
32.60% |
-8.5 |
The Δ column is the story. Models that handle ambiguity well (Anthropic, GPT-5.2/5.3) see massive score increases in the regression-style setup — they can take a vague description plus failing tests and figure out both the problem and the solution. Models that rely on precise instruction-following (Codex-5.2, Gemini Pro, alternative models) see modest gains or even declines.
A few models actually got worse with leaked tests: Gemini 3 Pro (-3.7), Kimi (-1.7), and Minimax (-8.5). Without detailed instructions, these models struggle to navigate the ambiguity, and the leaked tests don't compensate for the loss of structure. This is especially notable for Minimax, where the gap is dramatic, suggesting that the model heavily depends on explicit specification to perform well.
This maps cleanly to what we observe internally: teams that work in a more exploratory, "vibe coding" style prefer Anthropic models. Teams that write more detailed specifications for tasks like GPT family. Neither group is wrong — they're just operating in different regimes, and the standard benchmarks only measure one of them.
Model-by-Model Analysis
The analysis below is based primarily on the regression-style (Harbor) run, which we believe captures an important dimension of model capability that standard benchmarks miss. Where relevant, we note how things shift in the standard results.
Google Models
The situation with Gemini models is fairly confusing. Everyone has noted Google's breakthrough over the past year, the one that triggered a "code red" at OpenAI,, but in the coding domain, Gemini isn't as widely adopted as Codex or Claude Code. My theory is that despite excellent pretraining, Gemini's post-training sometimes leaves something to be desired when it comes to tool use.
Interestingly, back in the 3.0 generation, Gemini Flash, despite being a smaller model, showed better results on automated benchmarks than Pro, precisely because of its different post-training approach.The latest release seems to confirm our hypothesis: Google released two versions of the 3.1 model, one with custom tool use and a standard one. Despite the standard version being the default, the custom tools version performs better on benchmarks (and cheaper!). We hope Google has identified the root cause and that subsequent versions will work better out of the box with agentic CLIs.
It's also worth noting that our initial scores were about 15% lower because Gemini CLI was silently crashing (unlike its competitors) on some repositories. Setting up Gemini CLI took us the most time due to these kinds of "silent" bugs: runs would abort at random points, giving the illusion of completion. This doesn't reflect on the model's capabilities, of course, but it certainly hurts the experience.
That said, in terms of price-to-quality ratio, Flash is the clear favorite among all models. At just $0.29 per task on average, it delivered 50.55 ± 2.21%, not far behind the much more expensive Pro models. While you can't throw the hardest tasks at it, Flash performs search tasks better than anything else in its class, making it the best choice for many workloads.
Verdict: Gemini family models absolutely belong in your toolkit, but for now, unfortunately, only in a supporting role. And a stable harness for it, like Zencloder is helping to get more out of the model.
Note: In the standard results, all Google models cluster much tighter (45.9%–50.0%), and the Flash-vs-Pro relationship actually flips: Pro scores higher than Flash, reversing the regression-style result. This further supports our thesis that detailed instructions mask real capability differences.
OpenAI Models
OpenAI shows steady, consistent progress across generations. GPT-5.4 is now the strongest OpenAI model on the standard benchmark (51.3%), edging out GPT-5.3-Codex (49.6%). This makes sense: the standard setup rewards raw intelligence and careful reasoning, and 5.4 has more of both.
On the regression-style run, the picture is more nuanced. GPT-5.4 scores 62.3%, a strong result, but slightly behind GPT-5.3-Codex at 63.4%. The likely explanation: Codex models are optimized for the exact loop that the regression regime tests: take failing tests, iterate on code, repeat. GPT-5.4 is smarter but slightly less tuned for that tight feedback cycle. At $1.00 per task, 5.4 is also pricier than Codex ($0.70), though faster than the older GPT-5.2 ($1.26, 13:52) and reasonably quick at 7:32.
The broader trend matters more than any single comparison. From 5.2 to 5.3-Codex to 5.4, OpenAI has been steadily closing the gap: better scores, lower costs, faster execution. Codex remains the fastest model end-to-end on benchmarks like these (5:20 per task for 5.3-Codex), and we haven't even measured its speed using the new Responses API with WebSocket mode!
One behavioral quirk worth noting: OpenAI's Codex line has a tendency to refuse tasks when something "concerns" it. A linter not running, an env not configured. Where other models push through, Codex quietly gives up. This is precisely why GPT-5.2-Codex scored lower than the base GPT-5.2 in the regression-style run (48.9% vs 59.7%): the model simply stopped working on a significant chunk of tasks. With 5.3-Codex this improved but didn't fully go away; for roughly 20% of tasks the model initially produced empty solutions. We had to add an instruction: "Complete the task. If something concerns you, know that it's part of the task, and you still need to complete it."
Ironically, this is exactly how we discovered the Harbor bug in the first place: the model noticed that leaked tests shouldn't be there and refused to proceed. So this cautiousness is arguably a feature, not a bug. But in day-to-day use, it can be genuinely frustrating.
Verdict: OpenAI shows the most consistent improvement curve of the three major vendors: each release is a bit better, a bit cheaper, a bit faster; progress is incremental but very reliable. If you need a reliable workhorse with fast execution, especially in structured pipelines where strict instruction-following is an asset, OpenAI is a strong choice. More broadly, we're glad to see that each generation is getting faster and more token-efficient. In my opinion, speed was one of GPT's biggest weaknesses for a long time. If you tried OpenAI models a while back and gave up because they were too slow, it's worth giving them another look.
Note: In the standard results, all OpenAI models (47.2%–51.3%) cluster tighter than in the regression run (48.9%–63.4%), further confirming that detailed instructions compress model differences.
Anthropic
Anthropic shows impressive results. Curiously, Sonnet 4.6 outperformed Opus 4.6 on our regression-style benchmark: 78.90 ± 1.61% vs. 76.58 ± 1.54%. A resolve rate around 80% essentially means this benchmark has been beaten. And the pricing is quite reasonable: Sonnet at $0.69 per task is on par with GPT-5.3-Codex, while delivering over 15 percentage points more in resolve rate under ambiguity. Even Opus at $1.29 is quite affordable compared to Opus 4.1.
The Sonnet-over-Opus result may seem counterintuitive, but when we analyzed the trajectories, the picture became clearer. At this level, the tasks themselves aren't perfect: Opus tends to overthink and produce solutions that may theoretically be better, but don't pass the specific tests. We're essentially hitting the ceiling of what the benchmark can reliably measure.
End-to-end speed is also at a good level: despite Opus almost certainly being a much larger model than Codex, the difference isn't that noticeable. At 8:33 per task, Opus is quite comparable to Flash (8:31)!
Part of the success on this benchmark is due to the model being wrapped in a better harness. Claude's Opus uses Haiku as a minion on simpler jobs, which allows it to avoid cost blow-ups. OpenAI and Google can both significantly improve the day-to-day performance of their models by investing into stronger harnesses.
Note: In the standard results, the Anthropic advantage compresses dramatically (50.7%–52.7%), and Opus reclaims the lead over Sonnet — which makes sense: with detailed instructions, raw intelligence matters more than the ability to navigate ambiguity. With ambiguity, Opus tends to overthink in some cases - it happens in standard benchmark too, but it hurts more when you have golden tests available.
Open Weights & Alternative Models
The landscape here presents an interesting dynamic. In the standard, well-specified benchmark, the top open model, GLM-5 (47.5%), actually edges out Gemini 3 Flash (45.9%). However, in the ambiguous regression-style run, Flash reclaims the lead (50.55% vs. GLM-5's 50.00%) while remaining significantly cheaper. Keep in mind that Flash is a highly optimized small model, while GLM-5 is a larger, more expensive one.
That said, technology is a game of "where the puck is going," and this puck is flying hypersonic. The results we see today from models like GLM-5, Kimi k2.5, and Minimax 2.5 (which are now fully open and accessible) would have bested frontier models a year ago, and are getting good enough for some simple day-to-day jobs. The cost structure for these models will continue to drop with better distillation, predictive decoding, and such. The ability to tune these models, and to have direct access to log probs, also opens use-cases that are currently not accessible via frontier APIs.
We are looking forward to the next models from Meta, Mistral, Qwen, and DeepSeek. And we are keeping our fingers crossed for NVIDIA to put their immense scale behind their Groq acquisition to get us ultra-fast inference on these. When these models are distilled into dense ones, and the RAM shortage is over, who knows how far your own Mac Mini can take you.
Under the Hood: Similar Scores, Different Solutions
Wait, if Claude Sonnet 4.6 hits nearly 80% on regression-style tasks, does that mean autonomy is right here?
Not at all. Real-world tasks are far messier than any benchmark, and no single model will sustain 80% in production. Some of those shortcomings could be leveled by orchestration. Let’s dive deeper into the why. We ran a cross-cutting analysis on all 730 standard tasks and found that similar scores hide radically different solution sets. Models that look interchangeable on a leaderboard are actually solving different subsets of tasks.
Frontier Coverage
Out of 730 tasks, 506 (69.3%) were solved by at least one model, and 224 (30.7%) remain beyond all 15 models — the current frontier ceiling. At the other extreme, 144 tasks (19.7%) were trivially solved by all 15 models. The remaining 259 differentiating tasks (35.5%) are where models actually separate from each other.
Model Overlap: Who to Pair?
If ensembling matters, the next question is: which models are worth combining? We computed pairwise Jaccard overlap (tasks solved by both / tasks solved by either) across the top models:
- codex-5.3 ↔ gpt-5.4: 84% overlap — nearly redundant (both OpenAI, similar training)
- pro-3.1 ↔ pro-3: 78% overlap — same vendor, high redundancy
- opus-4.6 ↔ sonnet-4.6: 77% overlap — same vendor, but with meaningful unique solves each
- gemini-pro-3 ↔ sonnet-4.6: 68% overlap — most complementary cross-vendor pair
This has practical implications for pipeline design: if you're picking two models for an ensemble, you get far more out of pairing Anthropic + Google than pairing two OpenAI models. Sonnet 4.6, and GPT-5.4 each contribute irreplaceable solves — they genuinely cover different problem spaces.
The takeaway: if you're building a production pipeline, betting on a single model is leaving performance on the table. Under the hood, each model has blind spots that others cover. A well-designed orchestration layer can push your effective solve rate well beyond what any individual model achieves.
Conclusions
Two major insights emerged from this work.
First, evals are hard to design and easy to misread. When you measure performance on perfectly specified tasks, all models look the same. The real differentiation shows up exactly where benchmarks usually don't look: in the messy, ambiguous, underspecified reality of day-to-day engineering. Our accidental regression-style run showed this vividly: the same frontier models that cluster within a ~6-point band in the standard setup spread across a ~26-point range in the regression-style run.
Second, even when models score similarly, they solve different problems. Our cross-cutting analysis showed that models with nearly identical leaderboard scores have only 68–77% task overlap. This means orchestration isn't just a nice-to-have; it's a direct path to better results.
One of the key takeaways from these evaluations, and one that was unexpected even for me, is that we have already crossed the line where AI has superseded many engineers in its ability to quickly solve small-scale real-world engineering problems. Where humans still own the engineering universe are areas that require broader context and experience, such as system design and architecture, UX, and "peripheral vision" — being able to notice a problem before it becomes a problem. That sort of strategic engineering "chess" is still too complex to train the models on end-to-end.
However, part of that systems thinking can be brute-forced with agentic workflows. Since our cross-cutting analysis showed that models have different blind spots and complementary strengths, relying on a single model leaves performance on the table. A robust orchestration layer is critical.
For example, an AI orchestration platform like Zenflow can utilize a separate Review Agent equipped with a comprehensive rubric to double-check the work of the primary coding models. By asking the right questions, the orchestration layer pre-empts issues just like a strong senior engineer does:
- What are the security implications of this increment, and is there immediate important follow-up work?
- Does this change significantly reduce performance or hinder scalability?
- Is there a way to implement this feature that would be significantly better UX, either tactically for this increment, or maybe overall for the solution?
- Does the implementation follow KISS, DRY, and "low coupling + high cohesion" principles, or does it significantly complicate future extensibility/maintenance/operations?
This multi-agent orchestration, done right, reduces model blind spots and leverages the unique strengths of different models we discovered in our overlap analysis.
In the next series of posts, we'll share how to use different models effectively and how to build pipelines that leverage their complementary strengths.
PS Important Caveats
-
Data contamination. Despite the dataset's design ensuring that tasks shouldn't end up in training data due to licensing, knowledge about these repositories is still present in the models. That said, your agents shouldn't be running blind either. A big part of making AI work for you is making it easier for your AI agents to understand your repository (e.g., writing a good
AGENTS.mdor orchestrating a multi-agent workflow). - Task quality. The tasks were verified for correctness and feasibility. Real life is far messier. This is precisely why approaches like Spec-Driven Development (SDD) matter: additional specification and validation steps by different models help catch ambiguities and prevent them from shipping to production.
- Solved ≠ solved optimally. There's a reason every good engineering organization has human code review. Software products are complex engineering systems, and it's easy for humans or AI to miss something, so verification is key.
- Two setups, two trade-offs. The standard setup gives unrealistically detailed instructions but no test leakage — it measures the ability to implement a well-specified task. The regression-style setup gives minimal instructions but leaks failing tests — it measures the ability to diagnose and fix a problem from limited context, much like a real CI regression. Neither is "the right" benchmark. We present both because the contrast reveals that task specification quality is a hidden variable that dramatically affects benchmark outcomes.
This article was written by Andrew Filev and Dmitry Krasnov.